[ 2022 | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | past millennia ] |

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend
Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc
C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes,
and Sylvia Richardson.
**Interoperability of
statistical models in pandemic preparedness: principles and reality**.
*Stat. Sci.*, 37(2):183-206, May 2022.

** Abstract:** We
present interoperability as a guiding framework for statistical modelling to
assist policy makers asking multiple questions using diverse datasets in the
face of an evolving pandemic response. Interoperability provides an important
set of principles for future pandemic preparedness, through the joint design
and deployment of adaptable systems of statistical models for disease
surveillance using probabilistic reasoning. We illustrate this through case
studies for inferring and characterising spatial-temporal prevalence and
reproduction numbers of SARS-CoV-2 infections in England.

Javier Antorán, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel
Hernández-Lobato.
**Getting a CLUE: A
method for explaining uncertainty estimates**.
In *9th International Conference on Learning Representations*, April
2021.

** Abstract:** Both uncertainty estimation and
interpretability are important factors for trustworthy machine learning
systems. However, there is little work at the intersection of these two
areas. We address this gap by proposing a novel method for interpreting
uncertainty estimates from differentiable probabilistic models, like Bayesian
Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty
Explanations (CLUE), indicates how to change an input, while keeping it on
the data manifold, such that a BNN becomes more confident about the input's
prediction. We validate CLUE through 1) a novel framework for evaluating
counterfactual explanations of uncertainty, 2) a series of ablation
experiments, and 3) a user study. Our experiments show that CLUE outperforms
baselines and enables practitioners to better understand which input patterns
are responsible for predictive uncertainty..

Tameem Adel and Adrian Weller.
**TibGM: A
transferable and information-based graphical model approach for reinforcement
learning**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** One of the challenges to reinforcement
learning (RL) is scalable transferability among complex tasks. Incorporating
a graphical model (GM), along with the rich family of related methods, as a
basis for RL frameworks provides potential to address issues such as
transferability, generalisation and exploration. Here we propose a flexible
GM-based RL framework which leverages efficient inference procedures to
enhance generalisation and transfer power. In our proposed transferable and
information-based graphical model framework ‘TibGM’, we show the
equivalence between our mutual information-based objective in the GM, and an
RL consolidated objective consisting of a standard reward maximisation target
and a generalisation/transfer objective. In settings where there is a sparse
or deceptive reward signal, our TibGM framework is flexible enough to
incorporate exploration bonuses depicting intrinsic rewards. We empirically
verify improved performance and exploration power.

Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller.
**One-network
adversarial fairness**.
In *33rd AAAI Conference on Artificial Intelligence*, Hawaii, January
2019.

** Abstract:** There is currently a great expansion of
the impact of machine learning algorithms on our lives, prompting the need
for objectives other than pure performance, including fairness. Fairness here
means that the outcome of an automated decision-making system should not
discriminate between subgroups characterized by sensitive attributes such as
gender or race. Given any existing differentiable classifier, we make only
slight adjustments to the architecture including adding a new hidden layer,
in order to enable the concurrent adversarial optimization for fairness and
accuracy. Our framework provides one way to quantify the tradeoff between
fairness and accuracy, while also leading to strong empirical
performance.

Tameem Adel, Zoubin Ghahramani, and Adrian Weller.
**Discovering
interpretable representations for both deep generative and discriminative
models**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Interpretability of
representations in both deep generative and discriminative models is highly
desirable. Current methods jointly optimize an objective combining accuracy
and interpretability. However, this may reduce accuracy, and is not
applicable to already trained models. We propose two interpretability
frameworks. First, we provide an interpretable lens for an existing model. We
use a generative model which takes as input the representation in an existing
(generative or discriminative) model, weakly supervised by limited side
information. Applying a flexible and invertible transformation to the input
leads to an interpretable representation with no loss in accuracy. We extend
the approach using an active learning strategy to choose the most useful side
information to obtain, allowing a human to guide what ``interpretable" means.
Our second framework relies on joint optimization for a representation which
is both maximally informative about the side information and maximally
compressive about the non-interpretable data factors. This leads to a novel
perspective on the relationship between compression and regularization. We
also propose a new interpretability evaluation metric based on our framework.
Empirically, we achieve state-of-the-art results on three datasets using the
two proposed algorithms.

Javier Antorán, David Janz, James Urquhart Allingham, Erik A. Daxberger,
Riccardo Barbano, Eric T. Nalisnick, and José Miguel Hernández-Lobato.
**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan
Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji
Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, and
Rodolphe Jenatton.
**Sparse MoEs meet
efficient ensembles**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Machine learning models based on the aggregated outputs of
submodels, either at the activation or prediction levels, often exhibit
strong performance compared to individual models. We study the interplay of
two popular classes of such models: ensembles of neural networks and sparse
mixture of experts (sparse MoEs). First, we show that the two approaches have
complementary features whose combination is beneficial. This includes a
comprehensive evaluation of sparse MoEs in uncertainty related benchmarks.
Then, we present efficient ensemble of experts (E^{3}), a scalable
and simple ensemble of sparse MoEs that takes the best of both classes of
models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive
experiments demonstrate the accuracy, log-likelihood, few-shot learning,
robustness, and uncertainty improvements of E^{3} over several
challenging vision Transformer-based baselines. E^{3} not only
preserves its efficiency while scaling to models with up to 2.7B parameters,
but also provides better predictive performance and uncertainty estimates for
larger models.

** Comment:** Code

Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham,
Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent,
Rodolphe Jenatton, and Effrosyni Kokiopoulou.
**Deep classifiers with
label noise modeling and distance awareness**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Uncertainty estimation in deep learning has recently emerged as
a crucial area of interest to advance reliability and robustness in
safety-critical applications. While there have been many proposed methods
that either focus on distance-aware model uncertainties for
out-of-distribution detection or on input-dependent label uncertainties for
in-distribution calibration, both of these types of uncertainty are often
necessary. In this work, we propose the HetSNGP method for jointly modeling
the model and data uncertainty. We show that our proposed model affords a
favorable combination between these two types of uncertainty and thus
outperforms the baseline methods on some challenging out-of-distribution
datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we
propose HetSNGP Ensemble, an ensembled version of our method which
additionally models uncertainty over the network parameters and outperforms
other ensemble baselines.

** Comment:** Code

Erik A. Daxberger, Eric T. Nalisnick, James Urquhart Allingham, Javier
Antorán, and José Miguel Hernández-Lobato.
**Bayesian deep
learning via subnetwork inference**.
In Marina Meila and Tong Zhang, editors, *32nd International Conference on
Machine Learning*, volume 139 of *Proceedings of Machine Learning
Research*, pages 2510-2521. PMLR, 2021.

** Abstract:**
The Bayesian paradigm has the potential to solve core issues of deep neural
networks such as poor calibration and data inefficiency. Alas, scaling
Bayesian inference to large weight spaces often requires restrictive
approximations. In this work, we show that it suffices to perform inference
over a small subset of model weights in order to obtain accurate predictive
posteriors. The other weights are kept as point estimates. This subnetwork
inference framework enables us to use expressive, otherwise intractable,
posterior approximations over such subsets. In particular, we implement
subnetwork linearized Laplace: We first obtain a MAP estimate of all weights
and then infer a full-covariance Gaussian posterior over a subnetwork. We
propose a subnetwork selection strategy that aims to maximally preserve the
model's predictive uncertainty. Empirically, our approach is effective
compared to ensembles and less expressive posterior approximations over full
networks.

Chelsea Murray, James Urquhart Allingham, Javier Antorán, and José Miguel
Hernández-Lobato.
**Addressing bias
in active learning with depth uncertainty networks... or not**.
In Melanie F. Pradier, Aaron Schein, Stephanie L. Hyland, Francisco J. R. Ruiz,
and Jessica Zosa Forde, editors, *I (Still) Can't Believe It's Not Better!
Workshop at NeurIPS 2021, Virtual Workshop, December 13, 2021*, volume
163 of *Proceedings of Machine Learning Research*, pages 59-63.
PMLR, 2021.

** Abstract:** Farquhar et al. [2021] show that
correcting for active learning bias with underparameterised models leads to
improved downstream performance. For overparameterised models such as NNs,
however, correction leads either to decreased or unchanged performance. They
suggest that this is due to an "overfitting bias" which offsets the active
learning bias. We show that depth uncertainty networks operate in a low
overfitting regime, much like underparameterised models. They should
therefore see an increase in performance with bias correction. Surprisingly,
they do not. We propose that this negative result, as well as the results
Farquhar et al. [2021], can be explained via the lens of the bias-variance
decomposition of generalisation error.

Javier Antorán, James Urquhart Allingham, and José Miguel
Hernández-Lobato.
**Depth
uncertainty in neural networks**.
In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin, editors, *Advances in Neural Information Processing
Systems 33*, 2020.

** Abstract:** Existing methods for
estimating uncertainty in deep learning tend to require multiple forward
passes, making them unsuitable for applications where computational resources
are limited. To solve this, we perform probabilistic reasoning over the depth
of neural networks. Different depths correspond to subnetworks which share
weights and whose predictions are combined via marginalisation, yielding
model uncertainty. By exploiting the sequential structure of feed-forward
networks, we are able to both evaluate our training objective and make
predictions with a single forward pass. We validate our approach on
real-world regression and image classification tasks. Our approach provides
uncertainty calibration, robustness to dataset shift, and accuracies
competitive with more computationally expensive baselines.

** Comment:** Code

Javier Antorán, David Janz, James Urquhart Allingham, Erik A. Daxberger,
Riccardo Barbano, Eric T. Nalisnick, and José Miguel Hernández-Lobato.
**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

Javier Antorán, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel
Hernández-Lobato.
**Getting a CLUE: A
method for explaining uncertainty estimates**.
In *9th International Conference on Learning Representations*, April
2021.

** Abstract:** Both uncertainty estimation and
interpretability are important factors for trustworthy machine learning
systems. However, there is little work at the intersection of these two
areas. We address this gap by proposing a novel method for interpreting
uncertainty estimates from differentiable probabilistic models, like Bayesian
Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty
Explanations (CLUE), indicates how to change an input, while keeping it on
the data manifold, such that a BNN becomes more confident about the input's
prediction. We validate CLUE through 1) a novel framework for evaluating
counterfactual explanations of uncertainty, 2) a series of ablation
experiments, and 3) a user study. Our experiments show that CLUE outperforms
baselines and enables practitioners to better understand which input patterns
are responsible for predictive uncertainty..

Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna
Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan,
Jason Stanley, Omesh Tickoo, et al.
**Uncertainty as
a form of transparency: Measuring, communicating, and using
uncertainty**.
In *4th AAAI/ACM Conference on Artificial Intelligence, Ethics and
Society*, 2021.

** Abstract:** Algorithmic transparency
entails exposing system properties to various stakeholders for purposes that
include understanding, improving, and contesting predictions. Until now, most
research into algorithmic transparency has predominantly focused on
explainability. Explainability attempts to provide reasons for a machine
learning model's behavior to stakeholders. However, understanding a model's
specific behavior alone might not be enough for stakeholders to gauge whether
the model is wrong or lacks sufficient knowledge to solve the task at hand.
In this paper, we argue for considering a complementary form of transparency
by estimating and communicating the uncertainty associated with model
predictions. First, we discuss methods for assessing uncertainty. Then, we
characterize how uncertainty can be used to mitigate model unfairness,
augment decision-making, and build trustworthy systems. Finally, we outline
methods for displaying uncertainty to stakeholders and recommend how to
collect information required for incorporating uncertainty into existing ML
pipelines. This work constitutes an interdisciplinary review drawn from
literature spanning machine learning, visualization/HCI, design,
decision-making, and fairness. We aim to encourage researchers and
practitioners to measure, communicate, and use uncertainty as a form of
transparency.

Erik A. Daxberger, Eric T. Nalisnick, James Urquhart Allingham, Javier
Antorán, and José Miguel Hernández-Lobato.
**Bayesian deep
learning via subnetwork inference**.
In Marina Meila and Tong Zhang, editors, *32nd International Conference on
Machine Learning*, volume 139 of *Proceedings of Machine Learning
Research*, pages 2510-2521. PMLR, 2021.

** Abstract:**
The Bayesian paradigm has the potential to solve core issues of deep neural
networks such as poor calibration and data inefficiency. Alas, scaling
Bayesian inference to large weight spaces often requires restrictive
approximations. In this work, we show that it suffices to perform inference
over a small subset of model weights in order to obtain accurate predictive
posteriors. The other weights are kept as point estimates. This subnetwork
inference framework enables us to use expressive, otherwise intractable,
posterior approximations over such subsets. In particular, we implement
subnetwork linearized Laplace: We first obtain a MAP estimate of all weights
and then infer a full-covariance Gaussian posterior over a subnetwork. We
propose a subnetwork selection strategy that aims to maximally preserve the
model's predictive uncertainty. Empirically, our approach is effective
compared to ensembles and less expressive posterior approximations over full
networks.

Chelsea Murray, James Urquhart Allingham, Javier Antorán, and José Miguel
Hernández-Lobato.
**Addressing bias
in active learning with depth uncertainty networks... or not**.
In Melanie F. Pradier, Aaron Schein, Stephanie L. Hyland, Francisco J. R. Ruiz,
and Jessica Zosa Forde, editors, *I (Still) Can't Believe It's Not Better!
Workshop at NeurIPS 2021, Virtual Workshop, December 13, 2021*, volume
163 of *Proceedings of Machine Learning Research*, pages 59-63.
PMLR, 2021.

** Abstract:** Farquhar et al. [2021] show that
correcting for active learning bias with underparameterised models leads to
improved downstream performance. For overparameterised models such as NNs,
however, correction leads either to decreased or unchanged performance. They
suggest that this is due to an "overfitting bias" which offsets the active
learning bias. We show that depth uncertainty networks operate in a low
overfitting regime, much like underparameterised models. They should
therefore see an increase in performance with bias correction. Surprisingly,
they do not. We propose that this negative result, as well as the results
Farquhar et al. [2021], can be explained via the lens of the bias-variance
decomposition of generalisation error.

Javier Antorán, James Urquhart Allingham, and José Miguel
Hernández-Lobato.
**Depth
uncertainty in neural networks**.
In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin, editors, *Advances in Neural Information Processing
Systems 33*, 2020.

** Abstract:** Existing methods for
estimating uncertainty in deep learning tend to require multiple forward
passes, making them unsuitable for applications where computational resources
are limited. To solve this, we perform probabilistic reasoning over the depth
of neural networks. Different depths correspond to subnetworks which share
weights and whose predictions are combined via marginalisation, yielding
model uncertainty. By exploiting the sequential structure of feed-forward
networks, we are able to both evaluate our training objective and make
predictions with a single forward pass. We validate our approach on
real-world regression and image classification tasks. Our approach provides
uncertainty calibration, robustness to dataset shift, and accuracies
competitive with more computationally expensive baselines.

** Comment:** Code

Matthew Ashman, Thang D. Bui, Cuong V. Nguyen, Efstratios Markou, Adrian
Weller, Siddharth Swaroop, and Richard E. Turner.
**Partitioned variational inferece:
A framework for probabilistic federated learning**.
2022.

** Abstract:** The proliferation of computing devices has
brought about an opportunity to deploy machine learning models on new problem
domains using previously inaccessible data. Traditional algorithms for
training such models often require data to be stored on a single machine with
compute performed by a single node, making them unsuitable for decentralised
training on multiple devices. This deficiency has motivated the development
of federated learning algorithms, which allow multiple data owners to train
collaboratively and use a shared model whilst keeping local data private.
However, many of these algorithms focus on obtaining point estimates of model
parameters, rather than probabilistic estimates capable of capturing model
uncertainty, which is essential in many applications. Variational inference
(VI) has become the method of choice for fitting many modern probabilistic
models. In this paper we introduce partitioned variational inference (PVI), a
general framework for performing VI in the federated setting. We develop new
supporting theory for PVI, demonstrating a number of properties that make it
an attractive choice for practitioners; use PVI to unify a wealth of
fragmented, yet related literature; and provide empirical results that
showcase the effectiveness of PVI in a variety of federated settings.

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, and
Gunnar Rätsch.
**Scalable
Gaussian process variational autoencoders**.
In Arindam Banerjee and Kenji Fukumizu, editors, *Proceedings of The 24th
International Conference on Artificial Intelligence and Statistics*,
volume 130 of *Proceedings of Machine Learning Research*, pages
3511-3519. Proceedings of Machine Learning Research, 13-15 Apr 2021.

** Abstract:** Conventional variational autoencoders fail in
modeling correlations between data points due to their use of factorized
priors. Amortized Gaussian process inference through GP-VAEs has led to
significant improvements in this regard, but is still inhibited by the
intrinsic complexity of exact GP inference. We improve the scalability of
these methods through principled sparse inference approaches. We propose a
new scalable GP-VAE model that outperforms existing approaches in terms of
runtime and memory footprint, is easy to implement, and allows for joint
end-to-end optimization of all components.

Matthew Ashman, Jonny So, Will Tebbutt, Vincent Fortuin, Michael Pearce, and
Richard E. Turner.
**Sparse Gaussian process
variational autoencoders**.
2020.

** Abstract:** Large, multi-dimensional spatio-temporal
datasets are omnipresent in modern science and engineering. An effective
framework for handling such data are Gaussian process deep generative models
(GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing
approaches for performing inference in GP-DGMs do not support sparse GP
approximations based on inducing points, which are essential for the
computational efficiency of GPs, nor do they handle missing data - a natural
occurrence in many spatio-temporal datasets - in a principled manner. We
address these shortcomings with the development of the sparse Gaussian
process variational autoencoder (SGP-VAE), characterised by the use of
partial inference networks for parameterising sparse GP approximations.
Leveraging the benefits of amortised variational inference, the SGP-VAE
enables inference in multi-output sparse GPs on previously unobserved data
with no additional training. The SGP-VAE is evaluated in a variety of
experiments where it outperforms alternative approaches including
multi-output GPs and structured VAEs.

Matej Balog.
**Converting to
Optimization in Machine Learning: Perturb-and-MAP, Differential Privacy,
and Program Synthesis**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2020.

** Abstract:** On a mathematical level, most
computational problems encountered in machine learning are instances of one
of four abstract, fundamental problems: sampling, integration, optimization,
and search. Thanks to the rich history of the respective mathematical fields,
disparate methods with different properties have been developed for these
four problem classes. As a result it can be beneficial to convert a problem
from one abstract class into a problem of a different class, because the
latter might come with insights, techniques, and algorithms well suited to
the particular problem at hand. In particular, this thesis contributes four
new methods and generalizations of existing methods for converting specific
non-optimization machine learning tasks into optimization problems with more
appealing properties. The first example is partition function estimation (an
integration problem), where an existing algorithm - the Gumbel trick - for
converting to the MAP optimization problem is generalized into a more general
family of algorithms, such that other instances of this family have better
statistical properties. Second, this family of algorithms is further
generalized to another integration problem, the problem of estimating Rényi
entropies. The third example shows how an intractable sampling problem
arising when wishing to publicly release a database containing sensitive data
in a safe ("differentially private") manner can be converted into an
optimization problem using the theory of Reproducing Kernel Hilbert Spaces.
Finally, the fourth case study casts the challenging discrete search problem
of program synthesis from input-output examples as a supervised learning task
that can be efficiently tackled using gradient-based optimization. In all
four instances, the conversions result in novel algorithms with desirable
properties. In the first instance, new generalizations of the Gumbel trick
can be used to construct statistical estimators of the partition function
that achieve the same estimation error while using up to 40% fewer samples.
The second instance shows that unbiased estimators of the Rényi entropy can
be constructed in the Perturb-and-MAP framework. The main contribution of the
third instance is theoretical: the conversion shows that it is possible to
construct an algorithm for releasing synthetic databases that approximate
databases containing sensitive data in a mathematically precise sense, and to
prove results about their approximation errors. Finally, the fourth
conversion yields an algorithm for synthesising program source code from
input-output examples that is able to solve test problems 1-3 orders of
magnitude faster than a wide range of baselines.

Matej Balog, Rishabh Singh, Petros Maniatis, and Charles Sutton.
**Neural program synthesis with a
differentiable fixer**.
*arXiv*, 2020.

** Abstract:** We present a new program
synthesis approach that combines an encoder-decoder based synthesis
architecture with a differentiable program fixer. Our approach is inspired
from the fact that human developers seldom get their program correct on the
first attempt, and perform iterative testing-based program fixing to get to
the desired program functionality. Similarly, our approach first learns a
distribution over programs conditioned on an encoding of a set of
input-output examples, and then iteratively performs fix operations using the
differentiable fixer. The fixer takes as input the original examples and the
current program's outputs on example inputs, and generates a new distribution
over the programs with the goal of reducing the discrepancies between the
current program outputs and the desired example outputs. We train our
architecture end-to-end on the RobustFill domain, and show that the addition
of the fixer module leads to a significant improvement on synthesis accuracy
compared to using beam search.

Matej Balog, Bart van Merriënboer, Subhodeep Moitra, Yujia Li, and Daniel
Tarlow.
**Fast training of sparse graph
neural networks on dense hardware**.
*arXiv*, 2019.

** Abstract:** Graph neural networks have
become increasingly popular in recent years due to their ability to naturally
encode relational input data and their ability to scale to large graphs by
operating on a sparse representation of graph adjacency matrices. As we look
to scale up these models using custom hardware, a natural assumption would be
that we need hardware tailored to sparse operations and/or dynamic control
flow. In this work, we question this assumption by scaling up sparse graph
neural networks using a platform targeted at dense computation on fixed-size
data. Drawing inspiration from optimization of numerical algorithms on sparse
matrices, we develop techniques that enable training the sparse graph neural
network model from Allamanis et al. [2018] in 13 minutes using a 512-core
TPUv2 Pod, whereas the original training takes almost a day.

Matej Balog, Ilya Tolstikhin, and Bernhard Schölkopf.
**Differentially
private database release via kernel mean embeddings**.
In *35th International Conference on Machine Learning*, Stockholm,
Sweden, July 2018.

** Abstract:** We lay theoretical
foundations for new database release mechanisms that allow third-parties to
construct consistent estimators of population statistics, while ensuring that
the privacy of each individual contributing to the database is protected. The
proposed framework rests on two main ideas. First, releasing (an estimate of)
the kernel mean embedding of the data generating random variable instead of
the database itself still allows third-parties to construct consistent
estimators of a wide class of population statistics. Second, the algorithm
can satisfy the definition of differential privacy by basing the released
kernel mean embedding on entirely synthetic data points, while controlling
accuracy through the metric available in a Reproducing Kernel Hilbert Space.
We describe two instantiations of the proposed framework, suitable under
different scenarios, and prove theoretical results guaranteeing differential
privacy of the resulting algorithms and the consistency of estimators
constructed from their outputs.

** Comment:** [arXiv]

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller.
**Lost
relatives of the Gumbel trick**.
In *34th International Conference on Machine Learning*, Sydney,
Australia, August 2017.

** Abstract:** The Gumbel trick is a
method to sample from a discrete probability distribution, or to estimate its
normalizing partition function. The method relies on repeatedly applying a
random perturbation to the distribution in a particular way, each time
solving for the most likely configuration. We derive an entire family of
related methods, of which the Gumbel trick is one member, and show that the
new methods have superior properties in several settings with minimal
additional computational cost. In particular, for the Gumbel trick to yield
computational benefits for discrete graphical models, Gumbel perturbations on
all configurations are typically replaced with so-called low-rank
perturbations. We show how a subfamily of our new methods adapts to this
setting, proving new upper and lower bounds on the log partition function and
deriving a family of sequential samplers for the Gibbs distribution. Finally,
we balance the discussion by showing how the simpler analytical form of the
Gumbel trick enables additional theoretical results.

** Comment:** [arXiv] [Poster]
[Slides]
[Code]

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and
Daniel Tarlow.
**DeepCoder:
Learning to write programs**.
In *5th International Conference on Learning Representations*, Toulon,
France, April 2017.

** Abstract:** We develop a first line of
attack for solving programming competition-style problems from input-output
examples using deep learning. The approach is to train a neural network to
predict properties of the program that generated the outputs from the inputs.
We use the neural network's predictions to augment search techniques from the
programming languages community, including enumerative search and an
SMT-based solver. Empirically, we show that our approach leads to an order of
magnitude speedup over the strong non-augmented baselines and a Recurrent
Neural Network approach, and that we are able to solve problems of difficulty
comparable to the simplest problems on programming competition websites.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and
Richard Turner.
**Meta-learning
probabilistic inference for prediction**.
In *7th International Conference on Learning Representations*, New
Orleans, April 2019.

** Abstract:** This paper introduces a
new framework for data efficient and versatile learning. Specifically: 1) We
develop ML-PIP, a general framework for Meta-Learning approximate
Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic
interpretations of meta-learning to cover a broad class of methods. 2) We
introduce \Versa, an instance of the framework employing a flexible and
versatile amortization network that takes few-shot learning datasets as
inputs, with arbitrary numbers of shots, and outputs a distribution over
task-specific parameters in a single forward pass. Versa substitutes
optimization at test time with forward passes through inference networks,
amortizing the cost of inference and relieving the need for second
derivatives during training. 3) We evaluate \Versa on benchmark datasets
where the method sets new state-of-the-art results, and can handle arbitrary
number of shots, and for classification, arbitrary numbers of classes at
train and test time. The power of the approach is then demonstrated through a
challenging few-shot ShapeNet view reconstruction task.

Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen.
**Understanding
probabilistic sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 29*, 2016.

** Abstract:** Good sparse approximations are essential for
practical inference in Gaussian Processes as the computational cost of exact
methods is prohibitive for large datasets. The Fully Independent Training
Conditional (FITC) and the Variational Free Energy (VFE) approximations are
two recent popular methods. Despite superficial similarities, these
approximations have surprisingly different theoretical properties and behave
differently in practice. We thoroughly investigate the two methods for
regression both analytically and through illustrative examples, and draw
conclusions to guide practical application.

** Comment:** arXiv

Varun Babbar, Umang Bhatt, and Adrian Weller.
**On the utility of prediction sets
in human-ai teams**.
In *International Joint Conference on Artificial Intelligence*, 2022.

** Abstract:** Research on human-AI teams usually provides
experts with a single label, which ignores the uncertainty in a model's
recommendation. Conformal prediction (CP) is a well established line of
research that focuses on building a theoretically grounded, calibrated
prediction set, which may contain multiple labels. We explore how such
prediction sets impact expert decision-making in human-AI teams. Our
evaluation on human subjects finds that set valued predictions positively
impact experts. However, we notice that the predictive sets provided by CP
can be very large, which leads to unhelpful AI assistants. To mitigate this,
we introduce D-CP, a method to perform CP on some examples and defer to
experts. We prove that D-CP can reduce the prediction set size of
non-deferred examples. We show how D-CP performs in quantitative and in human
subject experiments (n=120). Our results suggest that CP prediction sets
improve human-AI team performance over showing the top-1 prediction alone,
and that experts find D-CP prediction sets are more useful than CP prediction
sets.

Katherine M. Collins, Umang Bhatt, and Adrian Weller.
**Eliciting and learning with soft
labels from every annotator**.
In *Proceedings of the AAAI Conference on Human Computation and
Crowdsourcing (HCOMP)*, 2022, doi 10.17863/CAM.87954.

** Abstract:** The labels used to train machine learning (ML)
models are of paramount importance. Typically for ML classification tasks,
datasets contain hard labels, yet learning using soft labels has been shown
to yield benefits for model generalization, robustness, and calibration.
Earlier work found success in forming soft labels from multiple annotators'
hard labels; however, this approach may not converge to the best labels and
necessitates many annotators, which can be expensive and inefficient. We
focus on efficiently eliciting soft labels from individual annotators. We
collect and release a dataset of soft labels (which we call CIFAR-10S) over
the CIFAR-10 test set via a crowdsourcing study (N=248). We demonstrate that
learning with our labels achieves comparable model performance to prior
approaches while requiring far fewer annotators - albeit with significant
temporal costs per elicitation. Our elicitation methodology therefore shows
nuanced promise in enabling practitioners to enjoy the benefits of improved
model performance and reliability with fewer annotators, and serves as a
guide for future dataset curators on the benefits of leveraging richer
information, such as categorical uncertainty, from individual annotators.

** Comment:** [Project
Page] [Data]
[Code]

Dan Ley, Umang Bhatt, and Adrian Weller.
**Diverse and amortised
counterfactual explanations for uncertainty estimates**.
In *Proceedings of the 36th AAAI Conference on Artificial Intelligence
(AAAI)*, 2022.

** Abstract:** To interpret uncertainty
estimates from differentiable probabilistic models, recent work has proposed
generating a single Counterfactual Latent Uncertainty Explanation (CLUE) for
a given data point where the model is uncertain. We broaden the exploration
to examine δ-CLUE, the set of potential CLUEs within a δ ball of the
original input in latent space. We study the diversity of such sets and find
that many CLUEs are redundant; as such, we propose DIVerse CLUE (∇-CLUE), a
set of CLUEs which each propose a distinct explanation as to how one can
decrease the uncertainty associated with an input. We then further propose
GLobal AMortised CLUE (GLAM-CLUE), a distinct, novel method which learns
amortised mappings that apply to specific groups of uncertain inputs, taking
them and efficiently transforming them in a single function call into inputs
for which a model will be certain. Our experiments show that δ-CLUE,
∇-CLUE, and GLAM-CLUE all address shortcomings of CLUE and provide
beneficial explanations of uncertainty estimates to practitioners.

J. von Kügelgen, A.-H. Karimi, U. Bhatt, I. Valera, A. Weller, and
B. Schölkopf.
**On the fairness of causal
algorithmic recourse**.
In *Proceedings of the 36th AAAI Conference on Artificial Intelligence
(AAAI)*, 2022.

** Abstract:** Algorithmic fairness is
typically studied from the perspective of predictions. Instead, here we
investigate fairness from the perspective of recourse actions suggested to
individuals to remedy an unfavourable classification. We propose two new
fairness criteria at the group and individual level, which - unlike prior
work on equalising the average group-wise distance from the decision boundary
- explicitly account for causal relationships between features, thereby
capturing downstream effects of recourse actions performed in the physical
world. We explore how our criteria relate to others, such as counterfactual
fairness, and show that fairness of recourse is complementary to fairness of
prediction. We study theoretically and empirically how to enforce fair causal
recourse by altering the classifier and perform a case study on the Adult
dataset. Finally, we discuss whether fairness violations in the data
generating process revealed by our criteria may be better addressed by
societal interventions as opposed to constraints on the classifier.

Javier Antorán, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel
Hernández-Lobato.
**Getting a CLUE: A
method for explaining uncertainty estimates**.
In *9th International Conference on Learning Representations*, April
2021.

** Abstract:** Both uncertainty estimation and
interpretability are important factors for trustworthy machine learning
systems. However, there is little work at the intersection of these two
areas. We address this gap by proposing a novel method for interpreting
uncertainty estimates from differentiable probabilistic models, like Bayesian
Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty
Explanations (CLUE), indicates how to change an input, while keeping it on
the data manifold, such that a BNN becomes more confident about the input's
prediction. We validate CLUE through 1) a novel framework for evaluating
counterfactual explanations of uncertainty, 2) a series of ablation
experiments, and 3) a user study. Our experiments show that CLUE outperforms
baselines and enables practitioners to better understand which input patterns
are responsible for predictive uncertainty..

Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna
Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan,
Jason Stanley, Omesh Tickoo, et al.
**Uncertainty as
a form of transparency: Measuring, communicating, and using
uncertainty**.
In *4th AAAI/ACM Conference on Artificial Intelligence, Ethics and
Society*, 2021.

** Abstract:** Algorithmic transparency
entails exposing system properties to various stakeholders for purposes that
include understanding, improving, and contesting predictions. Until now, most
research into algorithmic transparency has predominantly focused on
explainability. Explainability attempts to provide reasons for a machine
learning model's behavior to stakeholders. However, understanding a model's
specific behavior alone might not be enough for stakeholders to gauge whether
the model is wrong or lacks sufficient knowledge to solve the task at hand.
In this paper, we argue for considering a complementary form of transparency
by estimating and communicating the uncertainty associated with model
predictions. First, we discuss methods for assessing uncertainty. Then, we
characterize how uncertainty can be used to mitigate model unfairness,
augment decision-making, and build trustworthy systems. Finally, we outline
methods for displaying uncertainty to stakeholders and recommend how to
collect information required for incorporating uncertainty into existing ML
pipelines. This work constitutes an interdisciplinary review drawn from
literature spanning machine learning, visualization/HCI, design,
decision-making, and fairness. We aim to encourage researchers and
practitioners to measure, communicate, and use uncertainty as a form of
transparency.

Umang Bhatt, Adrian Weller, and Jose M. F. Moura.
**Evaluating and aggregating
feature-based model explanations**.
In *International Joint Conference on Artificial Intelligence*, 2020.

** Abstract:** A feature-based model explanation denotes how
much each input feature contributes to a model's output for a given data
point. As the number of proposed explanation functions grows, we lack
quantitative evaluation criteria to help practitioners know when to use which
explanation function. This paper proposes quantitative evaluation criteria
for feature-based explanations: low sensitivity, high faithfulness, and low
complexity. We devise a framework for aggregating explanation functions. We
develop a procedure for learning an aggregate explanation function with lower
complexity and then derive a new aggregate Shapley value explanation function
that minimizes sensitivity.

Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan
Jia, Joydeep Ghosh, Ruchir Puri, José M. F. Moura, and Peter Eckersley.
**Explainable machine learning in
deployment**.
In *ACM Conference on Fairness, Accountability, and Transparency
(FAT*)*, 2020.

** Abstract:** Explainable machine learning
offers the potential to provide stakeholders with insights into model
behavior by using various methods such as feature importance scores,
counterfactual explanations, or influential training data. Yet there is
little understanding of how organizations use these methods in practice. This
study explores how organizations view and use explainability for stakeholder
consumption. We find that, currently, the majority of deployments are not for
end users affected by the model but rather for machine learning engineers,
who use explainability to debug the model itself. There is thus a gap between
explainability in practice and the goal of transparency, since explanations
primarily serve internal stakeholders rather than external ones. Our study
synthesizes the limitations of current explainability techniques that hamper
their use for end users. To facilitate end user interaction, we develop a
framework for establishing clear goals for explainability. We end by
discussing concerns raised regarding explainability.

Botty Dimanov, Umang Bhatt, Mateja Jamnik, and Adrian Weller.
**You
shouldn't trust me: Learning models which conceal unfairness from multiple
explanation methods**.
In *European Conference on Artificial Intelligence (ECAI)*, 2020.

** Abstract:** Transparency of algorithmic systems has been
discussed as a way for end-users and regulators to develop appropriate trust
in machine learning models. One popular approach, LIME [26], even suggests
that model explanations can answer the question ``Why should I trust you?''
Here we show a straightforward method for modifying a pre-trained model to
manipulate the output of many popular feature importance explanation methods
with little change in accuracy, thus demonstrating the danger of trusting
such explanation methods. We show how this explanation attack can mask a
model’s discriminatory use of a sensitive feature, raising strong concerns
about using such explanation methods to check model fairness.

C. Lippert, Z. Ghahramani, and K. Borgwardt.
**Gene function
prediction from synthetic lethality networks via ranking on demand**.
*Bioinformatics*, 26:912-918, 2010.

** Abstract:**
Motivation: Synthetic lethal interactions represent pairs of genes whose
individual mutations are not lethal, while the double mutation of both genes
does incur lethality. Several studies have shown a correlation between
functional similarity of genes and their distances in networks based on
synthetic lethal interactions. However, there is a lack of algorithms for
predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for
gene function prediction from synthetic lethal interaction networks based on
kernel machines. We apply our novel algorithm to Gene Ontology functional
annotation prediction in yeast. Our experiments show that our method leads to
improved gene function prediction compared with state-of-the-art competitors
and that combining genetic and congruence networks leads to a further
improvement in prediction accuracy.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M.
Borgwardt.
**A
robust Bayesian two-sample test for detecting intervals of differential
gene expression in microarray time series**.
*Journal of Computational Biology*, 17(3):1-13, 2010, doi
10.1089/cmb.2009.0175.

** Abstract:** Understanding the
regulatory mechanisms that are responsible for an organism's response to
environmental change is an important issue in molecular biology. A first and
important step towards this goal is to detect genes whose expression levels
are affected by altered external conditions. A range of methods to test for
differential gene expression, both in static as well as in time-course
experiments, have been proposed. While these tests answer the question
*whether* a gene is differentially expressed, they do not explicitly
address the question *when* a gene is differentially expressed, although
this information may provide insights into the course and causal structure of
regulatory programs. In this article, we propose a twosample test for
identifying intervals of differential gene expression in microarray time
series. Our approach is based on Gaussian process regression, can deal with
arbitrary numbers of replicates, and is robust with respect to outliers. We
apply our algorithm to study the response of *Arabidopsis thaliana*
genes to an infection by a fungal pathogen using a microarray time series
dataset covering 30,336 gene probes at 24 observed time points. In
classification experiments, our test compares favorably with existing methods
and provides additional insights into time-dependent differential
expression.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt.
**A kernel
method for unsupervised structured network inference**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 368-375,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Network inference is the problem
of inferring edges between a set of real-world objects, for instance,
interactions between pairs of proteins in bioinformatics. Current
kernel-based approaches to this problem share a set of common features: (i)
they are supervised and hence require labeled training data; (ii) edges in
the network are treated as mutually independent and hence topological
properties are largely ignored; (iii) they lack a statistical interpretation.
We argue that these common assumptions are often undesirable for network
inference, and propose (i) an unsupervised kernel method (ii) that takes the
global structure of the network into account and (iii) is statistically
motivated. We show that our approach can explain commonly used heuristics in
statistical terms. In experiments on social networks, dfferent variants of
our method demonstrate appealing predictive performance.

Karsten M. Borgwardt and Zoubin Ghahramani.
**Bayesian two-sample
tests**.
*arXiv*, abs/0906.4032, 2009.

** Abstract:** In this
paper, we present two classes of Bayesian approaches to the two-sample
problem. Our first class of methods extends the Bayesian t-test to include
all parametric models in the exponential family and their conjugate priors.
Our second class of methods uses Dirichlet process mixtures (DPM) of such
conjugate-exponential distributions as flexible nonparametric priors over the
unknown distributions.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt.
**A robust
Bayesian two-sample test for detecting intervals of differential gene
expression in microarray time series**.
In *13th Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2009)*, volume 5541 of *Lecture Notes in
Bioinformatics*, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag,
doi
10.1007/978-3-642-02008-7_14.

** Abstract:** Understanding
the regulatory mechanisms that are responsible for an organism's response to
environmental changes is an important question in molecular biology. A first
and important step towards this goal is to detect genes whose expression
levels are affected by altered external conditions. A range of methods to
test for differential gene expression, both in static as well as in
time-course experiments, have been proposed. While these tests answer the
question *whether* a gene is differentially expressed, they do not
explicitly address the question *when* a gene is differentially
expressed, although this information may provide insights into the course and
causal structure of regulatory programs. In this article, we propose a
two-sample test for identifying *intervals* of differential gene
expression in microarray time series. Our approach is based on Gaussian
process regression, can deal with arbitrary numbers of replicates and is
robust with respect to outliers. We apply our algorithm to study the response
of *Arabidopsis thaliana* genes to an infection by a fungal pathogen
using a microarray time series dataset covering 30,336 gene probes at 24 time
points. In classification experiments our test compares favorably with
existing methods and provides additional insights into time-dependent
differential expression.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani.
**Metropolis
algorithms for representative subgraph sampling**.
In *Proceedings of 8th IEEE International Conference on Data Mining (ICDM
2008)*, pages 283-292, Pisa, Italy, December 2008. IEEE.
ISSN: 1550-4786.

** Abstract:** While data mining in
chemoinformatics studied graph data with dozens of nodes, systems biology and
the Internet are now generating graph data with thousands and millions of
nodes. Hence data mining faces the algorithmic challenge of coping with this
significant increase in graph size: Classic algorithms for data analysis are
often too expensive and too slow on large graphs.

While one strategy to
overcome this problem is to design novel efficient algorithms, the other is
to 'reduce' the size of the large graph by sampling. This is the scope of
this paper: We will present novel Metropolis algorithms for sampling a
'representative' small subgraph from the original large graph, with
'representative' describing the requirement that the sample shall preserve
crucial graph properties of the original graph. In our experiments, we
improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by
producing representative subgraph samples that are both smaller and of higher
quality than those produced by other methods from the literature.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

Elre T. Oldewage, John Bronskill, and Richard E. Turner.
**Adversarial attacks are a surprisingly strong baseline for poisoning
few-shot meta-learners**.
In *I Can't Believe It's Not Better, Workshop at Neurips 2022*, 2022.

** Abstract:** This paper examines the robustness of deployed
few-shot meta-learning systems when they are fed an imperceptibly perturbed
few-shot dataset. We attack amortized meta-learners, which allows us to craft
colluding sets of inputs that are tailored to fool the system's learning
algorithm when used as training data. Jointly crafted adversarial inputs
might be expected to synergistically manipulate a classifier, allowing for
very strong data-poisoning attacks that would be hard to detect. We show that
in a white box setting, these attacks are very successful and can cause the
target model's predictions to become worse than chance. However, in
opposition to the well-known transferability of adversarial examples in
general, the colluding sets do not transfer well to different classifiers. We
explore two hypotheses to explain this: 'overfitting' by the attack, and
mismatch between the model on which the attack is generated and that to which
the attack is transferred. Regardless of the mitigation strategies suggested
by these hypotheses, the colluding inputs transfer no better than adversarial
inputs that are generated independently in the usual way.

John Bronskill, Daniela Massiceti, Massimiliano Patacchiola, Katja Hofmann,
Sebastian Nowozin, and Richard E. Turner.
**Memory
efficient meta-learning with large images**.
In *Advances in Neural Information Processing Systems 35*, 2021.

** Abstract:** Meta learning approaches to few-shot
classification are computationally efficient at test time, requiring just a
few optimization steps or single forward pass to learn a new task, but they
remain highly memory-intensive to train. This limitation arises because a
task's entire support set, which can contain up to 1000 images, must be
processed before an optimization step can be taken. Harnessing the
performance gains offered by large images thus requires either parallelizing
the meta-learner across multiple GPUs, which may not be available, or
trade-offs between task and image size when memory constraints apply. We
improve on both options by proposing LITE, a general and memory efficient
episodic training scheme that enables meta-training on large tasks composed
of large images on a single GPU. We achieve this by observing that the
gradients for a task can be decomposed into a sum of gradients over the
task's training images. This enables us to perform a forward pass on a task's
entire training set but realize significant memory savings by
back-propagating only a random subset of these images which we show is an
unbiased approximation of the full gradient. We use LITE to train
meta-learners and demonstrate new state-of-the-art accuracy on the real-world
ORBIT benchmark and 3 of the 4 parts of the challenging VTAB+ MD benchmark
relative to leading meta-learners. LITE also enables meta-learners to be
competitive with transfer learning approaches but at a fraction of the
test-time computational cost, thus serving as a counterpoint to the recent
narrative that transfer learning is all you need for few-shot
classification.

Daniela Massiceti, Luisa Zintgraf, John Bronskill, Lida Theodorou,
Matthew Tobias Harris, Edward Cutrell, Cecily Morrison, Katja Hofmann, and
Simone Stumpf.
**Orbit:
A real-world few-shot dataset for teachable object recognition**.
In *Proceedings of the IEEE/CVF International Conference on Computer
Vision*, pages 10818-10828, 2021.

** Abstract:** Object
recognition has made great advances in the last decade, but predominately
still relies on many high-quality training examples per object category. In
contrast, learning new objects from only a few examples could enable many
impactful applications from robotics to user personalization. Most few-shot
learning research, however, has been driven by benchmark datasets that lack
the high variation that these applications will face when deployed in the
real-world. To close this gap, we present the ORBIT dataset and benchmark,
grounded in the real-world application of teachable object recognizers for
people who are blind/low-vision. The dataset contains 3,822 videos of 486
objects recorded by people who are blind/low-vision on their mobile phones.
The benchmark reflects a realistic, highly challenging recognition problem,
providing a rich playground to drive research in robustness to few-shot,
high-variation conditions. We set the benchmark's first state-of-the-art and
show there is massive scope for further innovation, holding the potential to
impact a broad range of real-world vision applications including tools for
the blind/low-vision community.

Elre T. Oldewage, John Bronskill, and Richard E. Turner.
**Attacking few-shot
classifiers with adversarial support poisoning**.
In *A Blessing in Disguise: The Prospects and Perils of Adversarial Machine
Learning, Workshop at ICML 2021*, 2021.

** Abstract:**
This paper examines the robustness of deployed few-shot meta-learning systems
when they are fed an imperceptibly perturbed few-shot dataset, showing that
the resulting predictions on test inputs can become worse than chance. This
is achieved by developing a novel attack, Adversarial Support Poisoning or
ASP, which crafts a poisoned set of examples. When even a small subset of
malicious data points is inserted into the support set of a meta-learner,
accuracy is significantly reduced. We evaluate the new attack on a variety of
few-shot classification algorithms and scenarios, and propose a form of
adversarial training that significantly improves robustness against both
poisoning and evasion attacks.

Megan Stanley, John Bronskill, Krzysztof Maziarz, Hubert Misztela, Jessica
Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt.
**Fs-mol: A few-shot
learning dataset of molecules**.
In *Thirty-fifth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track (Round 2)*, 2021.

**
Abstract:** Small datasets are ubiquitous in drug discovery as data
generation is expensive and can be restricted for ethical reasons (eg in vivo
experiments). A widely applied technique in early drug discovery to identify
novel active molecules against a protein target is modelling quantitative
structure-activity relationships (QSAR). It is known to be extremely
challenging, as available measurements of compound activities range in the
low dozens or hundreds. However, many such related datasets exist, each with
a small number of datapoints, opening up the opportunity for few-shot
learning after pre-training on a substantially larger corpus of data. At the
same time, many few-shot learning methods are currently evaluated in the
computer-vision domain. We propose that expansion into a new application, as
well as the possibility to use explicitly graph-structured data, will drive
exciting progress in few-shot learning. Here, we provide a few-shot learning
dataset (FS-Mol) and complementary benchmarking procedure. We define a set of
tasks on which few-shot learning methods can be evaluated, with a separate
set of tasks for use in pre-training. In addition, we implement and evaluate
a number of existing single-task, multi-task, and meta-learning approaches as
baselines for the community. We hope that our dataset, support code release,
and baselines will encourage future work on this extremely challenging new
domain for few-shot learning.

John Bronskill.
**Data
and computation efficient meta-learning**.
PhD thesis, University of Cambridge, Cambridge, UK, November 2020.

** Abstract:** In order to make predictions with high accuracy,
conventional deep learning systems require large training datasets consisting
of thousands or millions of examples and long training times measured in
hours or days, consuming high levels of electricity with a negative impact on
our environment. It is desirable to have have machine learning systems that
can emulate human behavior such that they can quickly learn new concepts from
only a few examples. This is especially true if we need to quickly customize
or personalize machine learning models to specific scenarios where it would
be impractical to acquire a large amount of training data and where a mobile
device is the means for computation. We define a data efficient machine
learning system to be one that can learn a new concept from only a few
examples (or shots) and a computation efficient machine learning system to be
one that can learn a new concept rapidly without retraining on an everyday
computing device such as a smart phone. In this work, we design, develop,
analyze, and extend the theory of machine learning systems that are both data
efficient and computation efficient. We present systems that are trained
using multiple tasks such that it "learns how to learn" to solve new tasks
from only a few examples. These systems can efficiently solve new, unseen
tasks drawn from a broad range of data distributions, in both the low and
high data regimes, without the need for costly retraining. Adapting to a new
task requires only a forward pass of the example task data through the
trained network making the learning of new tasks possible on mobile devices.
In particular, we focus on few-shot image classification systems, i.e.
machine learning systems that can distinguish between numerous classes of
objects depicted in digital images given only a few examples of each class of
object to learn from.

John Bronskill, Jonathan Gordon, James Requeima, Sebastian Nowozin, and
Richard E. Turner.
**TaskNorm:
rethinking batch normalization for meta-learning**.
In *37th International Conference on Machine Learning*. Proceedings of
Machine Learning Research, 2020.

** Abstract:** Modern
meta-learning approaches for image classification rely on increasingly deep
networks to achieve state-of-the-art performance, making batch normalization
an essential component of meta-learning pipelines. However, the hierarchical
nature of the meta-learning setting presents several challenges that can
render conventional batch normalization ineffective, giving rise to the need
to rethink normalization in this setting. We evaluate a range of approaches
to batch normalization for meta-learning scenarios, and develop a novel
approach that we call TASKNORM. Experiments on fourteen datasets demonstrate
that the choice of batch normalization has a dramatic effect on both
classification accuracy and training time for both gradient based and
gradient free meta-learning approaches. Importantly, TASKNORM is found to
consistently improve performance. Finally, we provide a set of best practices
for normalization that will allow fair comparison of meta-learning
algorithms.

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and
Richard Turner.
**Meta-learning
probabilistic inference for prediction**.
In *7th International Conference on Learning Representations*, New
Orleans, April 2019.

** Abstract:** This paper introduces a
new framework for data efficient and versatile learning. Specifically: 1) We
develop ML-PIP, a general framework for Meta-Learning approximate
Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic
interpretations of meta-learning to cover a broad class of methods. 2) We
introduce \Versa, an instance of the framework employing a flexible and
versatile amortization network that takes few-shot learning datasets as
inputs, with arbitrary numbers of shots, and outputs a distribution over
task-specific parameters in a single forward pass. Versa substitutes
optimization at test time with forward passes through inference networks,
amortizing the cost of inference and relieving the need for second
derivatives during training. 3) We evaluate \Versa on benchmark datasets
where the method sets new state-of-the-art results, and can handle arbitrary
number of shots, and for classification, arbitrary numbers of classes at
train and test time. The power of the approach is then demonstrated through a
challenging few-shot ShapeNet view reconstruction task.

James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and
Richard E. Turner.
**Fast
and flexible multi-task classification using conditional neural adaptive
processes**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** The goal of this paper is to design image
classification systems that, after an initial multi-task training phase, can
automatically adapt to new tasks encountered at test time. We introduce a
conditional neural process based approach to the multi-task classification
setting for this purpose, and establish connections to the meta- and few-shot
learning literature. The resulting approach, called CNAPs, comprises a
classifier whose parameters are modulated by an adaptation network that takes
the current task's dataset as input. We demonstrate that CNAPs achieves
state-of-the-art results on the challenging Meta-Dataset benchmark indicating
high-quality transfer-learning. We show that the approach is robust, avoiding
both over-fitting in low-shot regimes and under-fitting in high-shot regimes.
Timing experiments reveal that CNAPs is computationally efficient at
test-time as it does not involve gradient based adaptation. Finally, we show
that trained models are immediately deployable to continual learning and
active learning where they can outperform existing approaches that do not
leverage transfer learning.

Wessel P. Bruinsma, Martin Tegnér, and Richard E. Turner.
**Modelling
non-smooth signals with complex spectral structure**.
In *aistats25*, 2022.

** Abstract:** The Gaussian Process
Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with
complex spectral structure. A significant limitation of the GPCM is that it
assumes a rapidly decaying spectrum: it can only model smooth signals.
Moreover, inference in the GPCM currently requires (1) a mean-field
assumption, resulting in poorly calibrated uncertainties, and (2) a tedious
variational optimisation of large covariance matrices. We redesign the GPCM
model to induce a richer distribution over the spectrum with relaxed
assumptions about smoothness: the Causal Gaussian Process Convolution Model
(CGPCM) introduces a causality assumption into the GPCM, and the Rough
Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian
nonparametric generalisation of the fractional Ornstein–Uhlenbeck process.
We also propose a more effective variational inference scheme, going beyond
the mean-field assumption: we design a Gibbs sampler which directly samples
from the optimal variational solution, circumventing any variational
optimisation entirely. The proposed variations of the GPCM are validated in
experiments on synthetic and real-world data, showing promising results.

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, and Finale
Doshi-Velez.
**Wide
mean-field Bayesian neural networks ignore the data**.
In *25th International Conference on Artificial Intelligence and
Statistics*, 2022.

** Abstract:** Bayesian neural networks
(BNNs) combine the expressive power of deep learning with the advantages of
Bayesian formalism. In recent years, the analysis of wide, deep BNNs has
provided theoretical insight into their priors and posteriors. However, we
have no analogous insight into their posteriors under approximate inference.
In this work, we show that mean-field variational inference *entirely
fails to model the data* when the network width is large and the
activation function is odd. Specifically, for fully-connected BNNs with odd
activation functions and a homoscedastic Gaussian likelihood, we show that
the *optimal* mean-field variational posterior predictive (i.e.,
function space) distribution converges to the prior predictive distribution
as the width tends to infinity. We generalize aspects of this result to other
likelihoods. Our theoretical results are suggestive of underfitting behavior
previously observered in BNNs. While our convergence bounds are
non-asymptotic and constants in our analysis can be computed, they are
currently too loose to be applicable in standard training regimes. Finally,
we show that the optimal approximate posterior need not tend to the prior if
the activation function is not odd, showing that our statements cannot be
generalized arbitrarily.

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, and Carl E. Rasmussen.
**Sparse Gaussian
process hyperparameters: Optimize or integrate?**.
In *nips36*, 2022.

** Abstract:** The kernel function and
its hyperparameters are the central model selection choice in a Gaussian
process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the
kernel are chosen by maximising the marginal likelihood, an approach known as
Type-II maximum likelihood (ML-II). However, ML-II does not account for
hyperparameter uncertainty, and it is well-known that this can lead to
severely biased estimates and an underestimation of predictive uncertainty.
While there are several works which employ a fully Bayesian characterisation
of GPs, relatively few propose such approaches for the sparse GPs paradigm.
In this work we propose an algorithm for sparse Gaussian process regression
which leverages MCMC to sample from the hyperparameter posterior within the
variational inducing point framework of [Titsias, 2009]. This work is closely
related to Hensman et al. [2015b], but side-steps the need to sample the
inducing points, thereby significantly improving sampling efficiency in the
Gaussian likelihood case. We compare this scheme against natural baselines in
literature along with stochastic variational GPs (SVGPs) along with an
extensive computational analysis.

Stratis Markou, James Requeima, Wessel P. Bruinsma, Anna Vaughan, and
Richard E. Turner.
**Practical conditional
neural processes via tractable dependent predictions**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Conditional Neural Processes (CNPs; Garnelo et
al., 2018) are meta-learning models which leverage the flexibility of deep
learning to produce well-calibrated predictions and naturally handle
off-the-grid and missing data. CNPs scale to large datasets and train with
ease. Due to these features, CNPs appear well-suited to tasks from
environmental sciences or healthcare. Unfortunately, CNPs do not produce
correlated predictions, making them fundamentally inappropriate for many
estimation and decision making tasks. Predicting heat waves or floods, for
example, requires modelling dependencies in temperature or precipitation over
time and space. Existing approaches which model output dependencies, such as
Neural Processes (NPs; Garnelo et al., 2018b) or the FullConvGNP (Bruinsma et
al., 2021), are either complicated to train or prohibitively expensive. What
is needed is an approach which provides dependent predictions, but is simple
to train and computationally tractable. In this work, we present a new class
of Neural Process models that make correlated predictions and support exact
maximum likelihood training that is simple and scalable. We extend the
proposed models by using invertible output transformations, to capture
non-Gaussian output distributions. Our models can be used in downstream
estimation tasks which require dependent function samples. By accounting for
output dependencies, our models show improved predictive performance on a
range of experiments with synthetic and real data.

Ambrish Rawat, James Requeima, Wessel Bruinsma, and Richard Turner.
**Challenges and pitfalls of
Bayesian unlearning**.
In *ICML 2022 Workshop on Updatable Machine Learning (UpML)*, 2022.

** Abstract:** Machine unlearning refers to the task of removing
a subset of training data, thereby removing its contributions to a trained
model. Approximate unlearning are one class of methods for this task which
avoid the need to retrain the model from scratch on the retained data.
Bayes’ rule can be used to cast approximate unlearning as an inference
problem where the objective is to obtain the updated posterior by dividing
out the likelihood of deleted data. However this has its own set of
challenges as one often doesn’t have access to the exact posterior of the
model parameters. In this work we examine the use of the Laplace
approximation and Variational Inference to obtain the updated posterior. With
a neural network trained for a regression task as the guiding example, we
draw insights on the applicability of Bayesian unlearning in practical
scenarios.

Wessel P. Bruinsma, James Requeima, Andrew Y. K. Foong, Jonathan Gordon, and
Richard E. Turner.
**The Gaussian neural
process**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Neural Processes (NPs; Garnelo et al.,
2018a,b) are a rich class of models for meta-learning that map data sets
directly to predictive stochastic processes. We provide a rigorous analysis
of the standard maximum-likelihood objective used to train conditional NPs.
Moreover, we propose a new member to the Neural Process family called the
Gaussian Neural Process (GNP), which models predictive correlations,
incorporates translation equivariance, provides universal approximation
guarantees, and demonstrates encouraging performance.

Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, and Richard E. Turner.
**How
tight can PAC-Bayes be in the small data regime?**.
In *Advances in Neural Information Processing Systems 34*. Curran
Associates, Inc., 2021.

** Abstract:** In this paper, we
investigate the question: Given a small number of datapoints, for example N =
30, how tight can PAC-Bayes and test set bounds be made? For such small
datasets, test set bounds adversely affect generalisation performance by
withholding data from the training procedure. In this setting, PAC-Bayes
bounds are especially attractive, due to their ability to use all the data to
simultaneously learn a posterior and bound its generalisation risk. We focus
on the case of i.i.d. data with a bounded loss and consider the generic
PAC-Bayes theorem of Germain et al. While their theorem is known to recover
many existing PAC-Bayes bounds, it is unclear what the tightest bound
derivable from their framework is. For a fixed learning algorithm and
dataset, we show that the tightest possible bound coincides with a bound
considered by Catoni; and, in the more natural case of distributions over
datasets, we establish a lower bound on the best bound achievable in
expectation. Interestingly, this lower bound recovers the Chernoff test set
bound if the posterior is equal to the prior. Moreover, to illustrate how
tight these bounds can be, we study synthetic one-dimensional classification
tasks in which it is feasible to meta-learn both the prior and the form of
the bound to numerically optimise for the tightest bounds possible. We ind
that in this simple, controlled scenario, PAC-Bayes bounds are competitive
with comparable, commonly used Chernoff test set bounds. However, the
sharpest test set bounds still lead to better guarantees on the
generalisation error than the PAC-Bayes bounds we consider.

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann
Dubois, and Richard Turner.
**Convolutional
conditional neural processes**.
In *8th International Conference on Learning Representations*, Adis
Ababa, April 2020.

** Abstract:** We introduce the
Convolutional Conditional Neural Process (ConvCNP), a new member of the
Neural Process family that models translation equivariance in the data.
Translation equivariance is an important inductive bias for many learning
problems including time series modelling, spatial data, and images. The model
embeds data sets into an infinite-dimensional function space, as opposed to
finite-dimensional vector spaces. To formalize this notion, we extend the
theory of neural representations of sets to include functional
representations, and demonstrate that any translation-equivariant embedding
can be represented using a convolutional deep-set. We evaluate ConvCNPs in
several settings, demonstrating that they achieve state-of-the-art
performance compared to existing NPs. We demonstrate that building in
translation equivariance enables zero-shot generalization to challenging,
out-of-domain tasks.

Wessel Bruinsma, Eric Perim, Will Tebbutt, J. Scott Hosking, Arno Solin, and
Richard E. Turner.
**Scalable
exact inference in multi-output Gaussian processes**.
In *37th International Conference on Machine Learning*. Proceedings of
Machine Learning Research, 2020.

** Abstract:** Multi-output
Gaussian processes (MOGPs) leverage the flexibility and interpretability of
GPs while capturing structure across outputs, which is desirable, for
example, in spatio-temporal modelling. The key problem with MOGPs is their
computational scaling $O(n^3 p^3)$, which is cubic in the number of both
inputs $n$ (e.g., time points or locations) and outputs $p$. For this reason,
a popular class of MOGPs assumes that the data live around a low-dimensional
linear subspace, reducing the complexity to $O(n^3 m^3)$. However, this cost
is still cubic in the dimensionality of the subspace $m$, which is still
prohibitively expensive for many applications. We propose the use of a
sufficient statistic of the data to accelerate inference and learning in
MOGPs with orthogonal bases. The method achieves linear scaling in $m$ in
practice, allowing these models to scale to large $m$ without sacrificing
significant expressivity or requiring approximation. This advance opens up a
wide range of real-world tasks and can be combined with existing GP
approximations in a plug-and-play way. We demonstrate the efficacy of the
method on various synthetic and real-world data sets.

Andrew Y. K. Foong, Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, James
Requeima, and Richard E. Turner.
**Meta-learning
stationary stochastic process prediction with convolutional neural
processes**.
In *Advances in Neural Information Processing Systems 33*. Curran
Associates, Inc., 2020.

** Abstract:** Stationary stochastic
processes (SPs) are a key component of many probabilistic models, such as
those for off-the-grid spatio-temporal data. They enable the statistical
symmetry of underlying physical phenomena to be leveraged, thereby aiding
generalization. Prediction in such models can be viewed as a translation
equivariant map from observed data sets to predictive SPs, emphasizing the
intimate relationship between stationarity and equivariance. Building on
this, we propose the Convolutional Neural Process (ConvNP), which endows
Neural Processes (NPs) with translation equivariance and extends
convolutional conditional NPs to allow for dependencies in the predictive
distribution. The latter enables ConvNPs to be deployed in settings which
require coherent samples, such as Thompson sampling or conditional image
completion. Moreover, we propose a new maximum-likelihood objective to
replace the standard ELBO objective in NPs, which conceptually simplifies the
framework and empirically improves performance. We demonstrate the strong
performance and generalization capabilities of ConvNPs on 1D regression,
image completion, and various tasks with real-world spatio-temporal data.

James Requeima, William Tebbutt, Wessel Bruinsma, and Richard E. Turner.
**The Gaussian
process autoregressive regression model (GPAR)**.
In *22nd International Conference on Artificial Intelligence and
Statistics*. Proceedings of Machine Learning Research, 2019.

** Abstract:** Multi-output regression models must exploit
dependencies between outputs to maximise predictive performance. The
application of Gaussian processes (GPs) to this setting typically yields
models that are computationally demanding and have limited representational
power. We present the Gaussian Process Autoregressive Regression (GPAR)
model, a scalable multi-output GP model that is able to capture nonlinear,
possibly input-varying, dependencies between outputs in a simple and
tractable way: the product rule is used to decompose the joint distribution
over the outputs into a set of conditionals, each of which is modelled by a
standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and
real-world problems, outperforming existing GP models and achieving
state-of-the-art performance on established benchmarks.

Matthew Ashman, Thang D. Bui, Cuong V. Nguyen, Efstratios Markou, Adrian
Weller, Siddharth Swaroop, and Richard E. Turner.
**Partitioned variational inferece:
A framework for probabilistic federated learning**.
2022.

** Abstract:** The proliferation of computing devices has
brought about an opportunity to deploy machine learning models on new problem
domains using previously inaccessible data. Traditional algorithms for
training such models often require data to be stored on a single machine with
compute performed by a single node, making them unsuitable for decentralised
training on multiple devices. This deficiency has motivated the development
of federated learning algorithms, which allow multiple data owners to train
collaboratively and use a shared model whilst keeping local data private.
However, many of these algorithms focus on obtaining point estimates of model
parameters, rather than probabilistic estimates capable of capturing model
uncertainty, which is essential in many applications. Variational inference
(VI) has become the method of choice for fitting many modern probabilistic
models. In this paper we introduce partitioned variational inference (PVI), a
general framework for performing VI in the federated setting. We develop new
supporting theory for PVI, demonstrating a number of properties that make it
an attractive choice for practitioners; use PVI to unify a wealth of
fragmented, yet related literature; and provide empirical results that
showcase the effectiveness of PVI in a variety of federated settings.

Cuong V. Nguyen, Yingzhen Li, and Thang D. Bui Richard E. Turner.
**Variational
Continual Learning**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** This paper develops
variational continual learning (VCL), a simple but general framework for
continual learning that fuses online variational inference (VI) and recent
advances in Monte Carlo VI for neural networks. The framework can
successfully train both deep discriminative models and deep generative models
in complex continual learning settings where existing tasks evolve over time
and entirely new tasks emerge. Experimental results show that variational
continual learning outperforms state-of-the-art continual learning methods on
a variety of tasks, avoiding catastrophic forgetting in a fully automatic
way.

Thang D. Bui, Cuong V. Nguyen, and Richard E. Turner.
**Streaming
sparse Gaussian process approximations**.
In *Advances in Neural Information Processing Systems 31*,
volume 31, Long Beach, California, USA, December 2017.

**
Abstract:** Sparse approximations for Gaussian process models provide a
suite of methods that enable these models to be deployed in large data regime
and enable analytic intractabilities to be sidestepped. However, the field
lacks a principled method to handle streaming data in which the posterior
distribution over function values and the hyperparameters are updated in an
online fashion. The small number of existing approaches either use suboptimal
hand-crafted heuristics for hyperparameter learning, or suffer from
catastrophic forgetting or slow updating when new data arrive. This paper
develops a new principled framework for deploying Gaussian process
probabilistic models in the streaming setting, providing principled methods
for learning hyperparameters and optimising pseudo-input locations. The
proposed framework is experimentally validated using synthetic and real-world
datasets.

** Comment:** The first two authors contributed equally.

Thang D. Bui, Josiah Yan, and Richard E. Turner.
**A unifying framework for
Gaussian process pseudo-point approximations using power expectation
propagation**.
*Journal of Machine Learning Research*, 18(104):1-72, 2017.

** Abstract:** Gaussian processes (GPs) are flexible
distributions over functions that enable high-level assumptions about unknown
functions to be encoded in a parsimonious, flexible and general way. Although
elegant, the application of GPs is limited by computational and analytical
intractabilities that arise when data are sufficiently numerous or when
employing non-Gaussian models. Consequently, a wealth of GP approximation
schemes have been developed over the last 15 years to address these key
limitations. Many of these schemes employ a small set of pseudo data points
to summarise the actual data. In this paper we develop a new pseudo-point
approximation framework using Power Expectation Propagation (Power EP) that
unifies a large number of these pseudo-point approximations. Unlike much of
the previous venerable work in this area, the new framework is built on
standard methods for approximate inference (variational free-energy, EP and
Power EP methods) rather than employing approximations to the probabilistic
generative model itself. In this way all of the approximation is performed at
`inference time' rather than at `modelling time', resolving awkward
philosophical and empirical questions that trouble previous approaches.
Crucially, we demonstrate that the new framework includes new pseudo-point
approximation methods that outperform current approaches on regression and
classification tasks.

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Felipe Tobar, Thang D. Bui, and Richard E. Turner.
**Learning
stationary time series using gaussian process with nonparametric
kernels**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** We introduce the Gaussian
Process Convolution Model (GPCM), a two-stage nonparametric generative
procedure to model stationary signals as the convolution between a
continuous-time white-noise process and a continuous-time linear filter drawn
from Gaussian process. The GPCM is a continuous-time nonparametricwindow
moving average process and, conditionally, is itself a Gaussian process with
a nonparametric kernel defined in a probabilistic fashion. The generative
model can be equivalently considered in the frequency domain, where the power
spectral density of the signal is specified using a Gaussian process. One of
the main contributions of the paper is to develop a novel variational
freeenergy approach based on inter-domain inducing variables that efficiently
learns the continuous-time linear filter and infers the driving white-noise
process. In turn, this scheme provides closed-form probabilistic estimates of
the covariance kernel and the noise-free signal both in denoising and
prediction scenarios. Additionally, the variational inference procedure
provides closed-form expressions for the approximate posterior of the
spectral density given the observed data, leading to new Bayesian
nonparametric approaches to spectrum estimation. The proposed GPCM is
validated using synthetic and real-world signals.

Thang D. Bui and Richard E. Turner.
**Tree-structured Gaussian process approximations**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 28*,
volume 28, pages 2213-2221. Curran Associates, Inc., 2014.

** Abstract:** Gaussian process regression can be accelerated by
constructing a small pseudo-dataset to summarize the observed data. This idea
sits at the heart of many approximation schemes, but such an approach
requires the number of pseudo-datapoints to be scaled with the range of the
input space if the accuracy of the approximation is to be maintained. This
presents problems in time-series settings or in spatial datasets where large
numbers of pseudo-datapoints are required since computation typically scales
quadratically with the pseudo-dataset size. In this paper we devise an
approximation whose complexity grows linearly with the number of
pseudo-datapoints. This is achieved by imposing a tree or chain structure on
the pseudo-datapoints and calibrating the approximation using a
Kullback-Leibler (KL) minimization. Inference and learning can then be
performed efficiently using the Gaussian belief propagation algorithm. We
demonstrate the validity of our approach on a set of challenging regression
tasks including missing data imputation for audio and spatial datasets. We
trace out the speed-accuracy trade-off for the new method and show that the
frontier dominates those obtained from a large number of existing
approximation techniques.

David R. Burt.
**Scalable
Approximate Inference and Model Selection in Gaussian Process
Regression**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2022.

** Abstract:** Models with Gaussian process priors and
Gaussian likelihoods are one of only a handful of Bayesian models where
inference can be performed without the need for approximation. However, a
frequent criticism of these models from practitioners of Bayesian machine
learning is that they are challenging to scale to large datasets due to the
need to compute a large kernel matrix and perform standard linear-algebraic
operations with this matrix. This limitation has driven decades of research
in both statistics and machine learning seeking to scale Gaussian process
regression models to ever-larger datasets. This thesis builds on this line of
research. We focus on the problem of approximate inference and model
selection with approximate maximum marginal likelihood as applied to Gaussian
process regression. Our discussion is guided by three questions: Does an
approximation work on a range of models and datasets? Can you verify that an
approximation has worked on a given dataset? Is an approximation easy for a
practitioner to use? While we are far from the first to ask these questions,
we offer new insights into each question in the context of Gaussian process
regression. In the first part of this thesis, we focus on sparse variational
Gaussian process regression (Titsias, 2009). We provide new diagnostics for
inference with this method that can be used as practical guides for
practitioners trying to balance computation and accuracy with this
approximation. We then provide an asymptotic analysis that highlights
properties of the model and dataset that are sufficient for this
approximation to perform reliable inference with a small computational cost.
This analysis builds on an approach laid out in Burt (2018), as well as on
similar guarantees in the kernel ridge regression literature. In the second
part of this thesis, we consider iterative methods, especially the method of
conjugate gradients, as applied to Gaussian process regression (Gibbs and
MacKay, 1997). We primarily focus on improving the reliability of approximate
maximum marginal likelihood when using these approximations. We investigate
how the method of conjugate gradients and related approaches can be used to
derive bounds on quantities related to the log marginal likelihood. This idea
can be used to improve the speed and stability of model selection with these
approaches, making them easier to use in practice.

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, and Finale
Doshi-Velez.
**Wide
mean-field Bayesian neural networks ignore the data**.
In *25th International Conference on Artificial Intelligence and
Statistics*, 2022.

** Abstract:** Bayesian neural networks
(BNNs) combine the expressive power of deep learning with the advantages of
Bayesian formalism. In recent years, the analysis of wide, deep BNNs has
provided theoretical insight into their priors and posteriors. However, we
have no analogous insight into their posteriors under approximate inference.
In this work, we show that mean-field variational inference *entirely
fails to model the data* when the network width is large and the
activation function is odd. Specifically, for fully-connected BNNs with odd
activation functions and a homoscedastic Gaussian likelihood, we show that
the *optimal* mean-field variational posterior predictive (i.e.,
function space) distribution converges to the prior predictive distribution
as the width tends to infinity. We generalize aspects of this result to other
likelihoods. Our theoretical results are suggestive of underfitting behavior
previously observered in BNNs. While our convergence bounds are
non-asymptotic and constants in our analysis can be computed, they are
currently too loose to be applicable in standard training regimes. Finally,
we show that the optimal approximate posterior need not tend to the prior if
the activation function is not odd, showing that our statements cannot be
generalized arbitrarily.

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, and Carl E. Rasmussen.
**Sparse Gaussian
process hyperparameters: Optimize or integrate?**.
In *nips36*, 2022.

** Abstract:** The kernel function and
its hyperparameters are the central model selection choice in a Gaussian
process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the
kernel are chosen by maximising the marginal likelihood, an approach known as
Type-II maximum likelihood (ML-II). However, ML-II does not account for
hyperparameter uncertainty, and it is well-known that this can lead to
severely biased estimates and an underestimation of predictive uncertainty.
While there are several works which employ a fully Bayesian characterisation
of GPs, relatively few propose such approaches for the sparse GPs paradigm.
In this work we propose an algorithm for sparse Gaussian process regression
which leverages MCMC to sample from the hyperparameter posterior within the
variational inducing point framework of [Titsias, 2009]. This work is closely
related to Hensman et al. [2015b], but side-steps the need to sample the
inducing points, thereby significantly improving sampling efficiency in the
Gaussian likelihood case. We compare this scheme against natural baselines in
literature along with stochastic variational GPs (SVGPs) along with an
extensive computational analysis.

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der
Wilk, Carl Edward Rasmussen, and Hong Ge.
**Numerically stable sparse
Gaussian processes via minimum separation using cover trees**.
*arXiv*, 2022.

** Abstract:** As Gaussian processes
mature, they are increasingly being deployed as part of larger machine
learning and decision-making systems, for instance in geospatial modeling,
Bayesian optimization, or in latent Gaussian models. Within a system, the
Gaussian process model needs to perform in a stable and reliable manner to
ensure it interacts correctly with other parts the system. In this work, we
study the numerical stability of scalable sparse approximations based on
inducing points. We derive sufficient and in certain cases necessary
conditions on the inducing points for the computations performed to be
numerically stable. For low-dimensional tasks such as geospatial modeling, we
propose an automated method for computing inducing points satisfying these
conditions. This is done via a modification of the cover tree data structure,
which is of independent interest. We additionally propose an alternative
sparse approximation for regression with a Gaussian likelihood which trades
off a small amount of performance to further improve stability. We evaluate
the proposed techniques on a number of examples, showing that, in geospatial
settings, sparse approximations with guaranteed numerical stability often
perform comparably to those without.

Artem Artemev, David R. Burt, and Mark van der Wilk.
**Tighter
bounds on the log marginal likelihood of gaussian process regression using
conjugate gradients**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We propose a lower bound on the log marginal
likelihood of Gaussian process regression models that can be computed without
matrix factorisation of the full kernel matrix. We show that approximate
maximum likelihood learning of model parameters by maximising our lower bound
retains many benefits of the sparse variational approach while reducing the
bias introduced into hyperparameter learning. The basis of our bound is a
more careful analysis of the log-determinant term appearing in the log
marginal likelihood, as well as using the method of conjugate gradients to
derive tight lower bounds on the term involving a quadratic form. Our
approach is a step forward in unifying methods relying on lower bound
maximisation (e.g. variational methods) and iterative approaches based on
conjugate gradients for training Gaussian processes. In experiments, we show
improved predictive performance with our model for a comparable amount of
training time compared to other conjugate gradient based approaches.

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, and Mark van der
Wilk.
**Understanding
variational inference in function-space**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Recent work has attempted to directly
approximate the ‘function-space’ or predictive posterior distribution of
Bayesian models, without approximating the posterior distribution over the
parameters. This is appealing in e.g. Bayesian neural networks, where we only
need the former, and the latter is hard to represent. In this work, we
highlight some advantages and limitations of employing the Kullback-Leibler
divergence in this setting. For example, we show that minimizing the KL
divergence between a wide class of parametric distributions and the posterior
induced by a (non-degenerate) Gaussian process prior leads to an ill-defined
objective function. Then, we propose (featurized) Bayesian linear regression
as a benchmark for ‘function-space’ inference methods that directly
measures approximation quality. We apply this methodology to assess aspects
of the objective function and inference scheme considered in Sun et al.
(2018), emphasizing the quality of approximation to Bayesian inference as
opposed to predictive performance.

Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, and Richard E. Turner.
**How
tight can PAC-Bayes be in the small data regime?**.
In *Advances in Neural Information Processing Systems 34*. Curran
Associates, Inc., 2021.

** Abstract:** In this paper, we
investigate the question: Given a small number of datapoints, for example N =
30, how tight can PAC-Bayes and test set bounds be made? For such small
datasets, test set bounds adversely affect generalisation performance by
withholding data from the training procedure. In this setting, PAC-Bayes
bounds are especially attractive, due to their ability to use all the data to
simultaneously learn a posterior and bound its generalisation risk. We focus
on the case of i.i.d. data with a bounded loss and consider the generic
PAC-Bayes theorem of Germain et al. While their theorem is known to recover
many existing PAC-Bayes bounds, it is unclear what the tightest bound
derivable from their framework is. For a fixed learning algorithm and
dataset, we show that the tightest possible bound coincides with a bound
considered by Catoni; and, in the more natural case of distributions over
datasets, we establish a lower bound on the best bound achievable in
expectation. Interestingly, this lower bound recovers the Chernoff test set
bound if the posterior is equal to the prior. Moreover, to illustrate how
tight these bounds can be, we study synthetic one-dimensional classification
tasks in which it is feasible to meta-learn both the prior and the form of
the bound to numerically optimise for the tightest bounds possible. We ind
that in this simple, controlled scenario, PAC-Bayes bounds are competitive
with comparable, commonly used Chernoff test set bounds. However, the
sharpest test set bounds still lead to better guarantees on the
generalisation error than the PAC-Bayes bounds we consider.

David R. Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Convergence
of sparse variational inference in Gaussian processes regression**.
*Journal of Machine Learning Research*, 21, 2020.

**
Abstract:** Gaussian processes are distributions over functions that are
versatile and mathematically convenient priors in Bayesian modelling.
However, their use is often impeded for data with large numbers of
observations, N, due to the cubic (in N) cost of matrix operations used in
exact inference. Many solutions have been proposed that rely on M << N
inducing variables to form an approximation at a cost of O(NM^{2}).
While the computational cost appears linear in N, the true complexity depends
on how M must scale with N to ensure a certain quality of the approximation.
In this work, we investigate upper and lower bounds on how M needs to grow
with N to ensure high quality approximations. We show that we can make the
KL-divergence between the approximate model and the exact posterior
arbitrarily small for a Gaussian-noise regression model with M << N.
Specifically, for the popular squared exponential kernel and D-dimensional
Gaussian distributed covariates, M = O((log N)^{D}) suffice and a
method with an overall computational cost of O(N(log N)^{2D}(log log
N)^{2}) can be used to perform inference.

Andrew Foong, David Burt, Yingzhen Li, and Richard Turner.
**On the expressiveness of approximate inference in bayesian neural
networks**.
In *Advances in Neural Information Processing Systems 34*, 2020.

David Janz, David Burt, and Javier Gonzalez.
**Bandit
optimisation of functions in the Matérn kernel RKHS**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, 2020.

** Abstract:** We consider the problem
of optimising functions in the reproducing kernel Hilbert space (RKHS) of a
Matérn kernel with smoothness parameter $u$ over the domain $[0,1]^d$ under
noisy bandit feedback. Our contribution, the $π$-GP-UCB algorithm, is the
first practical approach with guaranteed sublinear regret for all $u>1$
and $d \geq 1$. Empirical validation suggests better performance and
drastically improved computational scalablity compared with its predecessor,
Improved GP-UCB.

David R Burt, Carl Edward Rasmussen, and Mark van der Wilk.
**Rates of convergence for sparse
variational Gaussian process regression**.
*arXiv*, 2019.

** Abstract:** Excellent variational
approximations to Gaussian process posteriors have been developed which avoid
the O(N^{3}) scaling with dataset size N. They reduce the
computational cost to O(NM^{2}), with M≪N being the number of
inducing variables, which summarise the process. While the computational cost
seems to be linear in N, the true complexity of the algorithm depends on how
M must increase to ensure a certain quality of approximation. We address this
by characterising the behavior of an upper bound on the KL divergence to the
posterior. We show that with high probability the KL divergence can be made
arbitrarily small by growing M more slowly than N. A particular case of
interest is that for regression with normally distributed inputs in
D-dimensions with the popular Squared Exponential kernel,
M=O(log^{D}N) is sufficient. Our results show that as datasets grow,
Gaussian process posteriors can truly be approximated cheaply, and provide a
concrete rule for how to increase M in continual learning scenarios.

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Lazily adapted constant kinky inference for non-parametric regression and
model-reference adaptive control**.
*Automatica*, 122, 2020, doi
10.1016/j.automatica.2020.109216.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction or Lipschitz
Interpolation are approaches to supervised machine learning that utilise
presupposed Lipschitz properties to perform inference over unobserved
function values. Provided a bound on the true best Lipschitz constant of the
target function is known a priori, they offer convergence guarantees, as well
as bounds around the predictions. Considering a more general setting that
builds on Lipschitz continuity, we propose an online method for estimating
the Lipschitz constant online from function value observations that are
possibly corrupted by bounded noise. Utilising this as a data-dependent
hyper-parameter gives rise to a nonparametric machine learning method, for
which we establish strong universal approximation guarantees. That is, we
show that our prediction rule can learn any continuous function on compact
support in the limit of increasingly dense data, up to a worst-case error
that can be bounded by the level of observational error. We also consider
applications of our nonparametric regression method to learning-based
control. For a class of discrete-time settings, we establish convergence
guarantees on the closed-loop tracking error of our online learning-based
controllers. To provide evidence that our method can be beneficial not only
in theory but also in practice, we apply it in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks.

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, and Jan
Maciejowski.
**Nonlinear set
membership regression with adaptive hyper-parameter estimation for online
learning and control**.
In *Proceedings of the European Control Conference*, 2018.

** Abstract:** Methods known as Lipschitz Interpolation or
Nonlinear Set Membership regression have become established tools for
nonparametric system-identification and data-based control. They utilise
presupposed Lipschitz properties to compute inferences over unobserved
function values. Unfortunately, they rely on the a priori knowledge of a
Lipschitz constant of the underlying target function which serves as a
hyperparameter. We propose a closed-form estimator of the Lipschitz constant
that is robust to bounded observational noise in the data. The merger of
Lipschitz Interpolation with the new hyperparameter estimator gives a new
nonparametric machine learning method for which we derive online learning
convergence guarantees. Furthermore, we apply our learning method to
model-reference adaptive control and provide a convergence guarantee on the
closed-loop dynamics. In a simulated flight manoeuvre control scenario, we
compare the performance of our approach to recently proposed alternative
learning-based controllers.

Daniel Limon, Jan-Peter Calliess, and Jan Maciejowski.
**Learning-based nonlinear model predictive control**.
In *IFAC 2017 World Congress*, Toulouse, France, July 2017. doi
10.1016/j.ifacol.2017.08.1050.

** Abstract:** This paper
presents stabilizing Model Predictive Controllers (MPC) in which prediction
models are inferred from experimental data of the inputs and outputs of the
plant. Using a nonparametric machine learning technique called LACKI, the
estimated (possibly nonlinear) model function together with an estimation of
Hoelder constant is provided. Based on these, a number of predictive
controllers with stability guaranteed by design are proposed. Firstly, the
case when the prediction model is estimated off- line is considered and
robust stability and recursive feasibility is ensured by using tightened
constraints in the optimisation problem. This controller has been extended to
the more interesting and complex case: the online learning of the model,
where the new data collected from feedback is added to enhance the prediction
model. A on-line learning MPC based on a double sequence of predictions is
proposed. Stability of the online learning MPC is proved. These controllers
are illustrated by simulation.

Jan-Peter Calliess.
**Lipschitz
optimisation for Lipschitz interpolation**.
In *2017 American Control Conference (ACC 2017)*, Seattle, WA, USA, May
2017.

** Abstract:** Techniques known as Nonlinear Set
Membership prediction, Kinky Inference or Lipschitz Interpolation are fast
and numerically robust approaches to nonparametric machine learning that have
been proposed to be utilised in the context of system identification and
learning-based control. They utilise presupposed Lipschitz properties in
order to compute inferences over unobserved function values. Unfortunately,
most of these approaches rely on exact knowledge about the input space metric
as well as about the Lipschitz constant. Furthermore, existing techniques to
estimate the Lipschitz constants from the data are not robust to noise or
seem to be ad-hoc and typically are decoupled from the ultimate learning and
prediction task. To overcome these limitations, we propose an approach for
optimising parameters of the presupposed metrics by minimising validation set
prediction errors. To avoid poor performance due to local minima, we propose
to utilise Lipschitz properties of the optimisation objective to ensure
global optimisation success. The resulting approach is a new flexible method
for nonparametric black-box learning. We illustrate its competitiveness on a
set of benchmark problems.

Jan-Peter Calliess, Nathan Korda, and Geoffrey J. Gordon.
**A distributed
mechanism for multi-agent convex optimisation and coordination with no-regret
learners**.
In *Workshop on Learning, Inference and Control of Multi-Agent Systems,
NIPS*, Barcelona, Spain, December 2016.

** Abstract:** We
develop an indirect mechanism for coordinated, distributed multi-agent
optimisation, and decision-making. Our approach extends previous work in
no-regret learning based mechanism design and renders it applicable to
partial information settings. We consider planning problems that can be
stated as a collection of single-agent convex programmes coupled by common
soft constraints. A key idea is to recast the joint optimisation problem as
distributed learning in a repeated game between the original agents and a
newly introduced group of adversarial agents who influence prices for
decisions and facilitate coordination. Under the weak behavioural assumption
that all agents employ selfish, sub-linear regret algorithms in the course of
the repeated game, we guarantee that our mechanism can achieve design goals
such as social optimality (efficiency) and Nash-equilibrium convergence to
within an error which approaches zero as the agents gain experience. Our
error bounds are deterministic or probabilistic, depending on the nature of
the regret bounds available for the algorithms employed by the agents. We
illustrate our method in an emissions market application.

Jan-Peter Calliess.
**Lazily adapted constant kinky
inference for nonparametric regression and model-reference adaptive
control**.
*arXiv*, arXiv:1701.00178, 2016.

** Abstract:**
Techniques known as Nonlinear Set Membership prediction, Lipschitz
Interpolation or Kinky Inference are approaches to machine learning that
utilise presupposed Lipschitz properties to compute inferences over
unobserved function values. Provided a bound on the true best Lipschitz
constant of the target function is known a priori they offer convergence
guarantees as well as bounds around the predictions. Considering a more
general setting that builds on Hölder continuity relative to
pseudo-metrics, we propose an online method for estimating the Hoelder
constant online from function value observations that possibly are corrupted
by bounded observational errors. Utilising this to compute adaptive
parameters within a kinky inference rule gives rise to a nonparametric
machine learning method, for which we establish strong universal
approximation guarantees. That is, we show that our prediction rule can learn
any continuous function in the limit of increasingly dense data to within a
worst-case error bound that depends on the level of observational
uncertainty. We apply our method in the context of nonparametric
model-reference adaptive control (MRAC). Across a range of simulated aircraft
roll-dynamics and performance metrics our approach outperforms recently
proposed alternatives that were based on Gaussian processes and RBF-neural
networks. For discrete-time systems, we provide stability guarantees for our
learning-based controllers both for the batch and the online learning
setting.

Jan-Peter Calliess.
**Bayesian
Lipschitz constant estimation and quadrature**.
In *Workshop on Probabilistic Integration, NIPS*, Montreal, Canada,
December 2015.

** Abstract:** Lipschitz quadrature methods
provide an approach to one-dimensional numerical integration on bounded
domains. On the basis of the assumption that the integrand is Lipschitz
continuous with a known Lipschitz constant, these quadrature rules can
provide a tight error bound around their integral estimates and utilise the
Lipschitz constant to guide exploration in the context of adaptive
quadrature. In this paper, we outline our ongoing work on extending this
approach to settings where the Lipschitz constant is probabilistically
uncertain. As the key component, we introduce a Bayesian approach for
updating a subjectively probabilistic belief of the Lipschitz constant.
Combined with any Lipschitz quadrature rule, we obtain an approach for
translating a sample into an integral estimate with probabilistic uncertainty
intervals. The paper concludes with an illustration of the approach followed
by a discussion of open issues and future work.

Talay M Cheema.
**Contrasting
discrete and continuous methods for Bayesian system identification**.
In *Workshop on Continuous Time Machine Learning at the 39th International
Conference on Machine Learning*, 2022.

** Abstract:** In
recent years, there has been considerable interest in embedding continuous
time methods in machine learning algorithms. In system identification, the
task is to learn a dynamical model from incomplete observation data, and when
prior knowledge is in continuous time – for example, mechanistic
differential equation models – it seems natural to use continuous time
models for learning. Yet when learning flexible, nonlinear, probabilistic
dynamics models, most previous work has focused on discrete time models to
avoid computational, numerical, and mathematical difficulties. In this work
we show, with the aid of small-scale examples, that this mismatch between
model and data generating process can be consequential under certain
circumstances, and we discuss possible modifications to discrete time models
which may better suit them to handling data generated by continuous time
processes.

Vidhi Lalchand, Kenza Tazi, Talay M Cheema, Richard E Turner, and Scott
Hosking.
**Kernel learning for explainable
climate science**.
In *16th Bayesian Modelling Applications Workshop at UAI, 2022*, 2022.

** Abstract:** The Upper Indus Basin, Himalayas provides water
for 270 million people and countless ecosystems. However, precipitation, a
key component to hydrological modelling, is poorly understood in this area. A
key challenge surrounding this uncertainty comes from the complex
spatial-temporal distribution of precipitation across the basin. In this work
we propose Gaussian processes with structured non-stationary kernels to model
precipitation patterns in the UIB. Previous attempts to quantify or model
precipitation in the Hindu Kush Karakoram Himalayan region have often been
qualitative or include crude assumptions and simplifications which cannot be
resolved at lower resolutions. This body of research also provides little to
no error propagation. We account for the spatial variation in precipitation
with a non-stationary Gibbs kernel parameterised with an input dependent
lengthscale. This allows the posterior function samples to adapt to the
varying precipitation patterns inherent in the distinct underlying topography
of the Indus region. The input dependent lengthscale is governed by a latent
Gaussian process with a stationary squared-exponential kernel to allow the
function level hyperparameters to vary smoothly. In ablation experiments we
motivate each component of the proposed kernel by demonstrating its ability
to model the spatial covariance, temporal structure and joint spatio-temporal
reconstruction. We benchmark our model with a stationary Gaussian process and
a Deep Gaussian processes.

Talay M Cheema.
**Understanding
local linearisation in variational Gaussian process state space
models**.
In *Time Series Workshop at the 38th International Conference on Machine
Learning*, 2021.

** Abstract:** We describe variational
inference approaches in Gaussian process state space models in terms of local
linearisations of the approximate posterior function. Most previous
approaches have either assumed independence between the posterior dynamics
and latent states (the mean-field (MF) approximation), or optimised free
parameters for both, leading to limited scalability. We use our framework to
prove that (i) there is a theoretical imperative to use non-MF approaches, to
avoid excessive bias in the process noise hyperparameter estimate, and (ii)
we can parameterise only the posterior dynamics without any less of
performance. Our approach suggests further approximations, based on the
existing rich literature on filtering and smoothing for nonlinear systems,
and unifies approaches for discrete and continuous time models.

Wenlin Chen, Samuel Horváth, and Peter Richtárik.
**Optimal client sampling
for federated learning**.
*Transactions on Machine Learning Research*, August 2022.

** Abstract:** It is well understood that client-master
communication can be a primary bottleneck in federated learning (FL). In this
work, we address this issue with a novel client subsampling scheme, where we
restrict the number of clients allowed to communicate their updates back to
the master node. In each communication round, all participating clients
compute their updates, but only the ones with important updates communicate
back to the master. We show that importance can be measured using only the
norm of the update and give a formula for optimal client participation. This
formula minimizes the distance between the full update, where all clients
participate, and our limited update, where the number of participating
clients is restricted. In addition, we provide a simple algorithm that
approximates the optimal formula for client participation, which allows for
secure aggregation and stateless clients, and thus does not compromise client
privacy. We show both theoretically and empirically that for Distributed SGD
(DSGD) and Federated Averaging (FedAvg), the performance of our approach can
be close to full participation and superior to the baseline where
participating clients are sampled uniformly. Moreover, our approach is
orthogonal to and compatible with existing methods for reducing communication
overhead, such as local methods and communication compression methods.

** Comment:** arXiv

Wenlin Chen, Austin Tripp, and José Miguel Hernández-Lobato.
**Meta-learning adaptive deep
kernel Gaussian processes for molecular property prediction**.
*arXiv*, 2022.

** Abstract:** We propose Adaptive Deep
Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework
for learning deep kernel Gaussian processes (GPs) by interpolating between
meta-learning and conventional deep kernel learning. Our approach employs a
bilevel optimization objective where we meta-learn generally useful feature
representations across tasks, in the sense that task-specific GP models
estimated on top of such features achieve the lowest possible predictive loss
on average. We solve the resulting nested optimization problem using the
implicit function theorem (IFT). We show that our ADKF-IFT framework contains
previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT)
as special cases. Although ADKF-IFT is a completely general method, we argue
that it is especially well-suited for drug discovery problems and demonstrate
that it significantly outperforms previous state-of-the-art methods on a
variety of real-world few-shot molecular property prediction tasks and
out-of-domain molecular property prediction and optimization tasks.

Austin Tripp, Wenlin Chen, and José Miguel Hernández-Lobato.
**An evaluation framework
for the objective functions of de novo drug design benchmarks**.
In *ICLR 2022 Workshop on Machine Learning for Drug Discovery*, 2022.

** Abstract:** De novo drug design has recently received
increasing attention from the machine learning community. It is important
that the field is aware of the actual goals and challenges of drug design and
the roles that de novo molecule design algorithms could play in accelerating
the process, so that algorithms can be evaluated in a way that reflects how
they would be applied in real drug design scenarios. In this paper, we
propose a framework for critically assessing the merits of benchmarks, and
argue that most of the existing de novo drug design benchmark functions are
either highly unrealistic or depend upon a surrogate model whose performance
is not well characterized. In order for the field to achieve its long-term
goals, we recommend that poor benchmarks (especially logP and QED) be
deprecated in favour of better benchmarks. We hope that our proposed
framework can play a part in developing new de novo drug design benchmarks
that are more realistic and ideally incorporate the intrinsic goals of drug
design.

Andrew Webb, Charles Reynolds, Wenlin Chen, Henry Reeve, Dan Iliescu, Mikel
Luján, and Gavin Brown.
**To
ensemble or not ensemble: When does end-to-end training fail?**.
In *European Conference on Machine Learning (ECML)*, 2020.

** Abstract:** End-to-End training (E2E) is becoming more and
more popular to train complex Deep Network architectures. An interesting
question is whether this trend will continue-are there any clear failure
cases for E2E training? We study this question in depth, for the specific
case of E2E training an ensemble of networks. Our strategy is to blend the
gradient smoothly in between two extremes: from independent training of the
networks, up to to full E2E training. We find clear failure cases, where
overparameterized models cannot be trained E2E. A surprising result is that
the optimum can sometimes lie in between the two, neither an ensemble or an
E2E system. The work also uncovers links to Dropout, and raises questions
around the nature of ensemble diversity and multi-branch networks.

** Comment:** arXiv

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani.
**Distributed
Inference for Dirichlet Process Mixture Models**.
37:2276-2284, 07-09 Jul 2015.

** Abstract:** Bayesian
nonparametric mixture models based on the Dirichlet process (DP) have been
widely used for solving problems like clustering, density estimation and
topic modelling. These models make weak assumptions about the underlying
process that generated the observed data. Thus, when more data are collected,
the complexity of these models can change accordingly. These theoretical
properties often lead to superior predictive performance when compared to
traditional finite mixture models. However, despite the increasing amount of
data available, the application of Bayesian nonparametric mixture models is
so far limited to relatively small data sets. In this paper, we propose an
efficient distributed inference algorithm for the DP and the HDP mixture
model. The proposed method is based on a variant of the slice sampler for
DPs. Since this sampler does not involve a pre-determined truncation, the
stationary distribution of the sampling algorithm is unbiased. We provide
both local thread-level and distributed machine-level parallel
implementations and study the performance of this sampler through an
extensive set of experiments on image and text data. When compared to
existing inference algorithms, the proposed method exhibits state-of-the-art
accuracy and strong scalability with up to 512 cores.

Anoop Korattikara, Yutian Chen, and Max Welling.
**Austerity
in MCMC land: Cutting the Metropolis-Hastings budget**.
In *31st International Conference on Machine Learning*, pages 181-189,
Beijing, China, June 2014.

** Abstract:** Can we make Bayesian
posterior MCMC sampling more efficient when faced with very large datasets?
We argue that computing the likelihood for N datapoints in the
Metropolis-Hastings (MH) test to reach a single binary decision is
computationally inefficient. We introduce an approximate MH rule based on a
sequential hypothesis test that allows us to accept or reject samples with
high confidence using only a fraction of the data required for the exact MH
rule. While this method introduces an asymptotic bias, we show that this bias
can be controlled and is more than offset by a decrease in variance due to
our ability to draw more samples per unit of time.

** Comment:** supplementary

Roger Frigola, Yutian Chen, and Carl Edward Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Ross M. Clarke, Elre T. Oldewage, and José Miguel Hernández-Lobato.
**Scalable one-pass
optimisation of high-dimensional weight-update hyperparameters by implicit
differentiation**.
In *10th International Conference on Learning Representations*, Virtual,
April 2022.

** Abstract:** Machine learning training methods
depend plentifully and intricately on hyperparameters, motivating automated
strategies for their optimisation. Many existing algorithms restart training
for each new hyperparameter choice, at considerable computational cost. Some
hypergradient- based one-pass methods exist, but these either cannot be
applied to arbitrary optimiser hyperparameters (such as learning rates and
momenta) or take several times longer to train than their base models. We
extend these existing methods to develop an approximate hypergradient-based
hyperparameter optimiser which is applicable to any continuous hyperparameter
appearing in a differentiable model weight update, yet requires only one
training episode, with no restarts. We also provide a motivating argument for
convergence to the true hypergradient, and perform tractable gradient-based
optimisation of independent learning rates for each model parameter. Our
method performs competitively from varied random hyperparameter
initialisations on several UCI datasets and Fashion-MNIST (using a one-layer
MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time
only 2-3x greater than vanilla training.

Katherine M. Collins, Umang Bhatt, and Adrian Weller.
**Eliciting and learning with soft
labels from every annotator**.
In *Proceedings of the AAAI Conference on Human Computation and
Crowdsourcing (HCOMP)*, 2022, doi 10.17863/CAM.87954.

** Abstract:** The labels used to train machine learning (ML)
models are of paramount importance. Typically for ML classification tasks,
datasets contain hard labels, yet learning using soft labels has been shown
to yield benefits for model generalization, robustness, and calibration.
Earlier work found success in forming soft labels from multiple annotators'
hard labels; however, this approach may not converge to the best labels and
necessitates many annotators, which can be expensive and inefficient. We
focus on efficiently eliciting soft labels from individual annotators. We
collect and release a dataset of soft labels (which we call CIFAR-10S) over
the CIFAR-10 test set via a crowdsourcing study (N=248). We demonstrate that
learning with our labels achieves comparable model performance to prior
approaches while requiring far fewer annotators - albeit with significant
temporal costs per elicitation. Our elicitation methodology therefore shows
nuanced promise in enabling practitioners to enjoy the benefits of improved
model performance and reliability with fewer annotators, and serves as a
guide for future dataset curators on the benefits of leveraging richer
information, such as categorical uncertainty, from individual annotators.

** Comment:** [Project
Page] [Data]
[Code]

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling multidimensional inference for structured Gaussian processes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, pages
424-436, 2015, doi
10.1109/TPAMI.2013.192.

** Abstract:** Exact Gaussian
process (GP) regression has O(N^{3} runtime for data size N, making
it intractable for large N. Many algorithms for improving GP scaling
approximate the covariance with lower rank matrices. Other work has exploited
structure inherent in particular covariance functions, including GPs with
implied Markov structure, and inputs on a lattice (both enable O(N) or O(N
log N) runtime). However, these GP advances have not been well extended to
the multidimensional input setting, despite the preponderance of
multidimensional applications. This paper introduces and tests three novel
extensions of structured GPs to multidimensional inputs, for models with
additive and multiplicative kernels. First we present a new method for
inference in additive GPs, showing a novel connection between the classic
backfitting method and the Bayesian framework. We extend this model using two
advances: a variant of projection pursuit regression, and a Laplace
approximation for non-Gaussian observations. Lastly, for multiplicative
kernel structure, we present a novel method for GPs with inputs on a
multidimensional grid. We illustrate the power of these three advances on
several data sets, achieving performance equal to or very close to the naive
GP at orders of magnitude less cost.

** Comment:** arXiv

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling
multidimensional Gaussian processes using projected additive
approximations**.
In *30th International Conference on Machine Learning*, 2013.

** Abstract:** Exact Gaussian Process (GP) regression has
O(N^{3}) runtime for data size N, making it intractable for large N.
Many algorithms for improving GP scaling approximate the covariance with
lower rank matrices. Other work has exploited structure inherent in
particular covariance functions, including GPs with implied Markov structure,
and equispaced inputs (both enable O(N) runtime). However, these GP advances
have not been extended to the multidimensional input setting, despite the
preponderance of multidimensional applications. This paper introduces and
tests novel extensions of structured GPs to multidimensional inputs. We
present new methods for additive GPs, showing a novel connection between the
classic backﬁtting method and the Bayesian framework. To achieve optimal
accuracy-complexity tradeoff, we extend this model with a novel variant of
projection pursuit regression. Our primary result – projection pursuit
Gaussian Process Regression – shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
non-Gaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham.
**Gpatt: Fast multidimensional
pattern extrapolation with Gaussian processes**.
*arXiv preprint arXiv:1310.5288*, 2013.

** Abstract:**
Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework - GPatt
- enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive
kernels and fast exact inference techniques. Without human intervention - no
hand crafting of kernel features, and no sophisticated initialisation
procedures - we show that GPatt can solve large scale pattern extrapolation,
inpainting, and kernel discovery problems, including a problem with 383,400
training points. We find that GPatt significantly outperforms popular
alternative scalable Gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits model structure are useful in combination for
modelling large scale multidimensional patterns.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

J. H. Macke, L. Busing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and
M. Sahani.
**Empirical
models of spiking in neural populations**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Neurons in the
neocortex code and compute as part of a locally interconnected population.
Large-scale multi-electrode recording makes it possible to access these
population processes empirically by fitting statistical models to unaveraged
data. What statistical structure best describes the concurrent spiking of
cells within a local network? We argue that in the cortex, where firing
exhibits extensive correlations in both time and space and where a typical
sample of neurons still reflects only a very small fraction of the local
population, the most appropriate model captures shared variability by a
low-dimensional latent process evolving with smooth dynamics, rather than by
putative direct coupling. We test this claim by comparing a latent dynamical
model with realistic spiking observations to coupled generalised linear
spike-response models (GLMs) using cortical recordings. We find that the
latent dynamical approach outperforms the GLM in terms of goodness-of- fit,
and reproduces the temporal correlations in the data more accurately. We also
compare models whose observations models are either derived from a Gaussian
or point-process models, finding that the non-Gaussian model provides
slightly better goodness-of-fit and more realistic population spike
counts.

B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy,
and M. Sahani.
**Dynamical
segmentation of single trials from population neural data**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Simultaneous
recordings of many neurons embedded within a recurrently-connected cortical
network may provide concurrent views into the dynamical processes of that
network, and thus its computational function. In principle, these dynamics
might be identified by purely unsupervised, statistical means. Here, we show
that a Hidden Switching Linear Dynamical Systems (HSLDS) model - in which
multiple linear dynamical laws approximate and nonlinear and potentially
non-stationary dynamical process - is able to distinguish dynamical regimes
within single-trial motor cortical activity associated with the preparation
and initiation of hand movements. The regimes are identified without
reference to behavioural or experimental epochs, but nonetheless transitions
between them correlate strongly with external events whose timing may vary
from trial to trial. The HSLDS model also performs better than recent
comparable models in predicting the firing rate of an isolated neuron based
on the firing rates of others, suggesting that it captures more of the
"Shared variance" of the data. Thus, the method is able to trace the
dynamical processes underlying the coordinated evolution of network activity
in a way that appears to reflect its computational role.

J. P. Cunningham, P. Nuyujukian, V. Gilja, C. A. Chestek, S. I. Ryu, and K. V.
Shenoy.
**A
closed-loop human simulator for investigating the role of feedback-control in
brain-machine interfaces**.
*Journal of Neurophysiology*, 105:1932-1949, 2011.

**
Abstract:** Neural prosthetic systems seek to improve the lives of severely
disabled people by decoding neural activity into useful behavioral commands.
These systems and their decoding algorithms are typically developed
"offline", using neural activity previously gathered from a healthy animal,
and the decoded movement is then compared with the true movement that
accompanied the recorded neural activity. However, this offline design and
testing may neglect important features of a real prosthesis, most notably the
critical role of feedback control, which enables the user to adjust neural
activity while using the prosthesis. We hypothesize that under- standing and
optimally designing high-performance decoders require an experimental
platform where humans are in closed-loop with the various candidate decode
systems and algorithms. It remains unexplored the extent to which the subject
can, for a particular decode system, algorithm, or parameter, engage feedback
and other strategies to improve decode performance. Closed-loop testing may
suggest different choices than offline analyses. Here we ask if a healthy
human subject, using a closed-loop neural prosthesis driven by synthetic
neural activity, can inform system design. We use this online pros- thesis
simulator (OPS) to optimize "online" decode performance based on a key
parameter of a current state-of-the-art decode algorithm, the bin width of a
Kalman filter. First, we show that offline and online analyses indeed suggest
different parameter choices. Previous literature and our offline analyses
agree that neural activity should be analyzed in bins of 100- to 300-ms
width. OPS analysis, which incorporates feedback control, suggests that much
shorter bin widths (25-50 ms) yield higher decode performance. Second, we
confirm this surprising finding using a closed-loop rhesus monkey prosthetic
system. These findings illustrate the type of discovery made possible by the
OPS, and so we hypothesize that this novel testing approach will help in the
design of prosthetic systems that will translate well to human patients.

M. Zhao, A. P. Batista, J. P. Cunningham, C. A. Chestek, Z. Rivera-Alvidrez,
R. Kalmar, S. I. Ryu, K. V. Shenoy, and S. Iyengar.
**An
L1-regularized logistic model for detecting short-term neuronal
interactions.**.
*Journal of Computational Neuroscience*, 2011, doi
10.1007/s10827-011-0365-5.
In Press.

** Abstract:** Interactions among neurons are a key
com- ponent of neural signal processing. Rich neural data sets potentially
containing evidence of interactions can now be collected readily in the
laboratory, but existing analysis methods are often not sufficiently
sensitive and specific to reveal these interactions. Generalized linear
models offer a platform for analyzing multi-electrode recordings of neuronal
spike train data. Here we suggest an L1-regularized logistic regression model
(L1L method) to detect short-term (order of 3ms) neuronal interactions. We
estimate the parameters in this model using a coordinate descent algorithm,
and determine the optimal tuning parameter using a Bayesian Information
Criterion. Simulation studies show that in general the L1L method has better
sensitivities and specificities than those of the traditional
shuffle-corrected cross-correlogram (covariogram) method. The L1L method is
able to detect excitatory interactions with both high sensitivity and
specificity with reasonably large recordings, even when the magnitude of the
interactions is small; similar results hold for inhibition given sufficiently
high baseline firing rates. Our study also suggests that the false positives
can be further removed by thresholding, because their magnitudes are
typically smaller than true interactions. Simulations also show that the L1L
method is somewhat robust to partially observed networks. We apply the method
to multi-electrode recordings collected in the monkey dorsal premotor cortex
(PMd) while the animal prepares to make reaching arm movements. The results
show that some neurons interact differently depending on task conditions. The
stronger interactions detected with our L1L method were also visible using
the covariogram method.

M. M. Churchland, J. P. Cunningham, M. T. Kaufman, S. I. Ryu, and K. V. Shenoy.
**Cortical
preparatory activity: Representation of movement or first cog in a dynamical
machine?**.
*Neuron*, 68:387-400, 2010.

** Abstract:** The motor
cortices are active during both movement and movement preparation. A common
assumption is that preparatory activity constitutes a subthreshold form of
movement activity: a neuron active during rightward movements becomes
modestly active during preparation of a rightward movement. We asked whether
this pattern of activity is, in fact, observed. We found that it was not: at
the level of a single neuron, preparatory tuning was weakly correlated with
movement-period tuning. Yet, somewhat paradoxically, preparatory tuning could
be captured by a preferred direction in an abstract "space" that described
the population-level pattern of movement activity. In fact, this relationship
accounted for preparatory responses better than did traditional tuning
models. These results are expected if preparatory activity provides the
initial state of a dynamical system whose evolution produces movement
activity. Our results thus suggest that preparatory activity may not
represent specific factors, and may instead play a more mechanistic role.

M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, M. R. Cohen, G. S.
Corrado, W. T. Newsome, A. M. Clark, P. Hosseini, B. B. Scott, D. C. Bradley,
M. A. Smith, A. Kohn, J. A. Movshon, K. M. Armstrong, T. Moore, S. W. Chang,
L. H. Snyder, S. G. Lisberger, N. J. Priebe, I. M. Finn, D. Ferster, S. I.
Ryu, G. Santhanam, M. Sahani, and K. V. Shenoy.
**Stimulus
onset quashes neural variability: a widespread cortical phenomenon**.
*Nature Neuroscience*, 13:369-378, 2010.

** Abstract:**
Neural responses are typically characterized by computing the mean firing
rate, but response variability can exist across trials. Many studies have
examined the effect of a stimulus on the mean response, but few have examined
the effect on response variability. We measured neural variability in 13
extracellularly recorded datasets and one intracellularly recorded dataset
from seven areas spanning the four cortical lobes in monkeys and cats. In
every case, stimulus onset caused a decline in neural variability. This
occurred even when the stimulus produced little change in mean firing rate.
The variability decline was observed in membrane potential recordings, in the
spiking of individual neurons and in correlated spiking variability measured
with implanted 96-electrode arrays. The variability decline was observed for
all stimuli tested, regardless of whether the animal was awake, behaving or
anaesthetized. This widespread variability decline suggests a rather general
property of cortex, that its state is stabilized by an input.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process
factor analysis for low-dimensional single-trial analysis of neural
population activity**.
In *Advances in Neural Information Processing Systems 21*, pages 1-8,
Vancouver, BC, December 2009.

** Abstract:** We consider the
problem of extracting smooth, low-dimensional neural trajectories that
summarize the activity recorded simultaneously from many neurons on
individual experimental trials. Beyond the benefit of visualizing the
high-dimensional, noisy spiking activity in a compact form, such trajectories
can offer insight into the dynamics of the neural circuitry underlying the
recorded activity. Current methods for extracting neural trajectories involve
a two-stage process: the spike trains are first smoothed over time, then a
static dimensionality- reduction technique is applied. We first describe
extensions of the two-stage methods that allow the degree of smoothing to be
chosen in a principled way and that account for spiking variability, which
may vary both across neurons and across time. We then present a novel method
for extracting neural trajectories - Gaussian-process factor analysis (GPFA)
- which unifies the smoothing and dimensionality- reduction operations in a
common probabilistic framework. We applied these methods to the activity of
61 neurons recorded simultaneously in macaque premotor and motor cortices
during reach planning and execution. By adopting a goodness-of-fit metric
that measures how well the activity of each neuron can be predicted by all
other recorded neurons, we found that the proposed extensions improved the
predictive ability of the two-stage methods. The predictive ability was
further improved by going to GPFA. From the extracted trajectories, we
directly observed a convergence in neural state during motor planning, an
effect that was shown indirectly by previous studies. We then show how such
methods can be a powerful tool for relating the spiking activity across a
neural population to the subject's behavior on a single-trial basis. Finally,
to assess how well the proposed methods characterize neural population
activity when the underlying time course is known, we performed simulations
that revealed that GPFA performed tens of percent better than the best
two-stage method.

C. Chang, J. P. Cunningham, and G. Glover.
**Influence
of heart rate on the bold signal: the cardiac response function**.
*NeuroImage*, 44:857-869, 2009.

** Abstract:** It has
previously been shown that low-frequency fluctuations in both respiratory
volume and cardiac rate can induce changes in the blood-oxygen level
dependent (BOLD) signal. Such physiological noise can obscure the detection
of neural activation using fMRI, and it is therefore important to model and
remove the effects of this noise. While a hemodynamic response function
relating respiratory variation (RV) and the BOLD signal has been described,
no such mapping for heart rate (HR) has been proposed. In the current study,
the effects of RV and HR are simultaneously deconvolved from resting state
fMRI. It is demonstrated that a convolution model including RV and HR can
explain significantly more variance in gray matter BOLD signal than a model
that includes RV alone, and an average HR response function is proposed that
well characterizes our subject population. It is observed that the voxel-wise
morphology of the deconvolved RV responses is preserved when HR is included
in the model, and that its form is adequately modeled by Birn et al.'s
previously described respiration response function. Furthermore, it is shown
that modeling out RV and HR can significantly alter functional connectivity
maps of the default-mode network.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process
factor analysis for low-dimensional single-trial analysis of neural
population activity**.
*Journal of Neurophysiology*, 102:614-635, 2009.

**
Abstract:** We consider the problem of extracting smooth, low-dimensional
neural trajectories that summarize the activity recorded simultaneously from
many neurons on individual experimental trials. Beyond the benefit of
visualizing the high-dimensional, noisy spiking activity in a compact form,
such trajectories can offer insight into the dynamics of the neural circuitry
underlying the recorded activity. Current methods for extracting neural
trajectories involve a two-stage process: the spike trains are first smoothed
over time, then a static dimensionality- reduction technique is applied. We
first describe extensions of the two-stage methods that allow the degree of
smoothing to be chosen in a principled way and that account for spiking
variability, which may vary both across neurons and across time. We then
present a novel method for extracting neural trajectories - Gaussian-process
factor analysis (GPFA) - which unifies the smoothing and dimensionality-
reduction operations in a common probabilistic framework. We applied these
methods to the activity of 61 neurons recorded simultaneously in macaque
premotor and motor cortices during reach planning and execution. By adopting
a goodness-of-fit metric that measures how well the activity of each neuron
can be predicted by all other recorded neurons, we found that the proposed
extensions improved the predictive ability of the two-stage methods. The
predictive ability was further improved by going to GPFA. From the extracted
trajectories, we directly observed a convergence in neural state during motor
planning, an effect that was shown indirectly by previous studies. We then
show how such methods can be a powerful tool for relating the spiking
activity across a neural population to the subject's behavior on a
single-trial basis. Finally, to assess how well the proposed methods
characterize neural population activity when the underlying time course is
known, we performed simulations that revealed that GPFA performed tens of
percent better than the best two-stage method.

J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani.
**Inferring
neural firing rates from spike trains using Gaussian processes**.
In *Advances in Neural Information Processing Systems 20*, pages 1-8,
Vancouver, BC, December 2008.

** Abstract:** Neural spike
trains present challenges to analytical efforts due to their noisy, spiking
nature. Many studies of neuroscientific and neural prosthetic importance rely
on a smoothed, denoised estimate of the spike train's underlying firing rate.
Current techniques to find time-varying firing rates require ad hoc choices
of parameters, offer no confidence intervals on their estimates, and can
obscure potentially important single trial variability. We present a new
method, based on a Gaussian Process prior, for inferring probabilistically
optimal estimates of firing rate functions underlying single or multiple
neural spike trains. We test the performance of the method on simulated data
and experimentally gathered neural spike trains, and we demonstrate
improvements over conventional estimators.

** Comment:** Spotlight Presentation

J. P. Cunningham, K. V. Shenoy, and M. Sahani.
**Fast
Gaussian process methods for point process intensity estimation**.
In *25th International Conference on Machine Learning*, pages 1-8,
Helsinki, Finland, June 2008.

** Abstract:** Point processes
are difficult to analyze because they provide only a sparse and noisy
observation of the intensity function driving the process. Gaussian Processes
offer an attractive framework within which to infer underlying intensity
functions. The result of this inference is a continuous function defined
across time that is typically more amenable to analytical efforts. However, a
naive implementation will become computationally infeasible in any problem of
reasonable size, both in memory and run time requirements. We demonstrate
problem specific methods for a class of renewal processes that eliminate the
memory burden and reduce the solve time by orders of magnitude.

J. P. Cunningham.
**Derivation
of Expectation Propagation for "fast Gaussian process methods for point
process intensity estimation"**.
Technical report, Stanford University, 2008.

** Abstract:** We
derive the Expectation Propagation algorithm updates for approximating the
posterior distribution on intensity in a conditionally inhomogeneous gamma
interval process with a Gaussian Process prior (GP IGIP), a model which
appeared in Cunningham, Shenoy, Sahani (2008) ICML.

Fergus Simpson, Ian Davies, Vidhi Lalchand, Alessandro Vullo, Nicolas Durrande,
and Carl Edward Rasmussen.
**Kernel
identification through transformers**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 10483-10495, 2021.

** Abstract:**
Kernel selection plays a central role in determining the performance of
Gaussian Process (GP) models, as the chosen kernel determines both the
inductive biases and prior support of functions under the GP prior. This work
addresses the challenge of constructing custom kernel functions for
high-dimensional GP regression models. Drawing inspiration from recent
progress in deep learning, we introduce a novel approach named KITT: Kernel
Identification Through Transformers. KITT exploits a transformer-based
architecture to generate kernel recommendations in under 0.1 seconds, which
is several orders of magnitude faster than conventional kernel search
algorithms. We train our model using synthetic data generated from priors
over a vocabulary of known kernels. By exploiting the nature of the
self-attention mechanism, KITT is able to process datasets with inputs of
arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong
performance over a diverse collection of regression benchmarks.

Alex Davies.
**Effective
implementation of Gaussian process regression for machine learning**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis presents frameworks for the
effective implementation of Gaussian process regression for machine learning.
It addresses this in three parts: effective iterative methods for calculating
the predictive distribution and derivatives of a Gaussian process with fixed
hyper-parameters, defining three broad classes of kernels of controllable
complexity that allow for an order of magnitude scaling in the previous
framework and an investigation into alternative objective functions and
improved derivatives for the optimization of model hyper-parameters.

Alex Davies and Zoubin Ghahramani.
**The random forest
kernel and other kernels for big data from random partitions**.
*arXiv*, abs/1402.4293, 2014.

** Abstract:** We present
Random Partition Kernels, a new class of kernels derived by demonstrating a
natural connection between random partitions of objects and kernels between
those objects. We show how the construction can be used to create kernels
from methods that would not normally be viewed as random partitions, such as
Random Forest. To demonstrate the potential of this method, we propose two
new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show
that these kernels consistently outperform standard kernels on problems
involving real-world datasets. Finally, we show how the form of these kernels
lend themselves to a natural approximation that is appropriate for certain
big data problems, allowing O(N) inference in methods such as Gaussian
Processes, Support Vector Machines and Kernel PCA.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

A. Davies and Z. Ghahramani.
**Language-independent
Bayesian sentiment mining of twitter**.
In *In The Fifth Workshop on Social Network Mining and Analysis
(SNA-KDD 2011)*, August 2011.

** Abstract:** This paper
outlines a new language-independent model for sentiment analysis of short,
social-network statuses. We demonstrate this on data from Twitter, modelling
happy vs sad sentiment, and show that in some circumstances this outperforms
similar Naive Bayes models by more than 10%. We also propose an extension to
allow the modelling of differ- ent sentiment distributions in different
geographic regions, while incorporating information from neighbouring
regions. We outline the considerations when creating a system analysing
Twitter data and present a scalable system of data acquisi- tion and
prediction that can monitor the sentiment of tweets in real time.

Javier Antorán, David Janz, James Urquhart Allingham, Erik A. Daxberger,
Riccardo Barbano, Eric T. Nalisnick, and José Miguel Hernández-Lobato.
**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

Erik A. Daxberger, Eric T. Nalisnick, James Urquhart Allingham, Javier
Antorán, and José Miguel Hernández-Lobato.
**Bayesian deep
learning via subnetwork inference**.
In Marina Meila and Tong Zhang, editors, *32nd International Conference on
Machine Learning*, volume 139 of *Proceedings of Machine Learning
Research*, pages 2510-2521. PMLR, 2021.

** Abstract:**
The Bayesian paradigm has the potential to solve core issues of deep neural
networks such as poor calibration and data inefficiency. Alas, scaling
Bayesian inference to large weight spaces often requires restrictive
approximations. In this work, we show that it suffices to perform inference
over a small subset of model weights in order to obtain accurate predictive
posteriors. The other weights are kept as point estimates. This subnetwork
inference framework enables us to use expressive, otherwise intractable,
posterior approximations over such subsets. In particular, we implement
subnetwork linearized Laplace: We first obtain a MAP estimate of all weights
and then infer a full-covariance Gaussian posterior over a subnetwork. We
propose a subnetwork selection strategy that aims to maximally preserve the
model's predictive uncertainty. Empirically, our approach is effective
compared to ensembles and less expressive posterior approximations over full
networks.

Vincent Dutordoir, Hugh Salimbeni, Marc Deisenroth, and James Hensman.
**Gaussian
process conditional density estimation**.
In *Advances in Neural Information Processing Systems 32*, Montréal,
Canada, Dec 2018.

** Abstract:** Conditional Density
Estimation (CDE) models deal with estimating conditional distributions. The
conditions imposed on the distribution are the inputs of the model. CDE is a
challenging task as there is a fundamental trade-off between model
complexity, representational capacity and overfitting. In this work, we
propose to extend the model's input with latent variables and use Gaussian
processes (GP) to map this augmented input onto samples from the conditional
distribution. Our Bayesian approach allows for the modeling of small
datasets, but we also provide the machinery for it to be applied to big data
using stochastic variational inference. Our approach can be used to model
densities even in sparse data regions, and allows for sharing learned
structure between conditions. We illustrate the effectiveness and
wide-reaching applicability of our model on a variety of real- world
problems, such as spatio-temporal density estimation of taxi drop-offs,
non-Gaussian noise modeling, and few-shot learning on omniglot images.

Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth.
**Manifold
Gaussian processes for regression**.
In *International Joint Conference on Neural Networks*, 2016.

** Abstract:** Off-the-shelf Gaussian Process (GP) covariance
functions encode smoothness assumptions on the structure of the function to
be modeled. To model complex and nondifferentiable functions, these
smoothness assumptions are often too restrictive. One way to alleviate this
limitation is to find a different representation of the data by introducing a
feature space. This feature space is often learned in an unsupervised way,
which might lead to data representations that are not useful for the overall
regression task. In this paper, we propose Manifold Gaussian Processes, a
novel supervised method that jointly learns a transformation of the data into
a feature space and a GP regression from the feature space to observed space.
The Manifold GP is a full GP and allows to learn data representations, which
are useful for the overall regression task. As a proof-of-concept, we
evaluate our approach on complex non-smooth functions where standard GPs
perform poorly, such as step functions and robotics tasks with contacts.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen.
**Gaussian processes for data-efficient learning in robotics and control**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
37:408-423, 2015, doi
10.1109/TPAMI.2013.218.

** Abstract:** Autonomous learning
has been a promising direction in control and robotics for more than a decade
since data-driven learning allows to reduce the amount of engineering
knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system
to learn controllers, which is a practical limitation in real systems, such
as robots, where many interactions can be impractical and time consuming. To
address this problem, current learning approaches typically require
task-speciﬁc knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or speciﬁc knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up
learning by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the
system. By explicitly incorporating model uncertainty into long-term planning
and controller learning our approach reduces the effects of model errors, a
key problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012, doi
10.1109/TAC.2011.2179426.

** Abstract:** We propose a
principled algorithm for robust Bayesian filtering and smoothing in nonlinear
stochastic dynamic systems when both the transition function and the
measurement function are described by nonparametric Gaussian process (GP)
models. GPs are gaining increasing importance in signal processing, machine
learning, robotics, and control for representing unknown system functions by
posterior probability distributions. This modern way of "system
identification" is more robust than finding point estimates of a parametric
function representation. Our principled filtering/smoothing approach for GP
dynamic systems is based on analytic moment matching in the context of the
forward-backward algorithm. Our numerical evaluations demonstrate the
robustness of the proposed approach in situations where other
state-of-the-art Gaussian filters and smoothers can fail.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox.
**Learning to
control a low-cost manipulator using data-efficient reinforcement
learning**.
In *9th International Conference on Robotics: Science & Systems*, Los
Angeles, CA, USA, June 2011.

** Abstract:** Over the last
years, there has been substantial progress in robust manipulation in
unstructured environments. The long-term goal of our work is to get away from
precise, but very expensive robotic systems and to develop affordable,
potentially imprecise, self-adaptive manipulator systems that can
interactively perform tasks such as playing with children. In this paper, we
demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop
policies for a stacking task in only a handful of trials - from scratch. Our
manipulator is inaccurate and provides no pose feedback. For learning a
controller in the work space of a Kinect-style depth camera, we use a
model-based reinforcement learning technique. Our learning method is data
efficient, reduces model bias, and deals with several noise sources in a
principled way during long-term planning. We present a way of incorporating
state-space constraints into the learning process and analyze the learning
gain by exploiting the sequential structure of the stacking task.

** Comment:** project
site

Marc Peter Deisenroth and Carl Edward Rasmussen.
**PILCO: A
model-based and data-efficient approach to policy search**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** In this paper, we introduce PILCO, a practical,
data-efficient model-based policy search method. PILCO reduces model bias,
one of the key problems of model-based reinforcement learning, in a
principled way. By learning a probabilistic dynamics model and explicitly
incorporating model uncertainty into long-term planning, PILCO can cope with
very little data and facilitates learning from scratch in only a few trials.
Policy evaluation is performed in closed form using state-of-the-art
approximate inference. Furthermore, policy gradients are computed
analytically for policy improvement. We report unprecedented learning
efficiency on challenging and high-dimensional control tasks.

** Comment:** web
site

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

Marc Peter Deisenroth.
**Efficient reinforcement
learning using Gaussian processes**.
PhD thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2010.

** Abstract:** In many research areas, including control and
medical applications, we face decision-making problems where data are limited
and/or the underlying generative process is complicated and partially
unknown. In these scenarios, we can profit from algorithms that learn from
data and aid decision making.

Reinforcement learning (RL) is a general
computational approach to experience-based goal-directed learning for
sequential decision making under uncertainty. However, RL often lacks
efficiency in terms of the number of required trials when no task-specific
knowledge is available. This lack of efficiency makes RL often inapplicable
to (optimal) control problems. Thus, a central issue in RL is to speed up
learning by extracting more information from available experience.

The
contributions of this dissertation are threefold:

1. We propose PILCO, a
fully Bayesian approach for efficient RL in continuous-valued state and
action spaces when no expert knowledge is available. PILCO is based on
well-established ideas from statistics and machine learning. PILCO's key
ingredient is a probabilistic dynamics model learned from data, which is
implemented by a Gaussian process (GP). The GP carefully quantifies knowledge
by a probability distribution over plausible dynamics models. By averaging
over all these models during long-term planning and decision making, PILCO
takes uncertainties into account in a principled way and, therefore, reduces
model bias, a central problem in model-based RL.

2. Due to its generality
and efficiency, PILCO can be considered a conceptual and practical approach
to jointly learning models and controllers when expert knowledge is difficult
to obtain or simply not available. For this scenario, we investigate PILCO's
properties its applicability to challenging real and simulated nonlinear
control problems. For example, we consider the tasks of learning to swing up
a double pendulum attached to a cart or to balance a unicycle with five
degrees of freedom. Across all tasks we report unprecedented automation and
an unprecedented learning efficiency for solving these tasks.

3. As a
step toward pilco's extension to partially observable Markov decision
processes, we propose a principled algorithm for robust filtering and
smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for
nonlinear systems, it does neither rely on function linearization nor on
finite-sample representations of densities. Our algorithm profits from exact
moment matching for predictions while keeping all computations analytically
tractable. We present experimental evidence that demonstrates the robustness
and the advantages of our method over unscented Kalman filters, the cubature
Kalman filter, and the extended Kalman filter.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Efficient
reinforcement learning for motor control**.
In *10th International PhD Workshop on Systems and Control*, Hluboká
nad Vltavou, Czech Republic, September 2009.

** Abstract:**
Artificial learners often require many more trials than humans or animals
when learning motor control tasks in the absence of expert knowledge. We
implement two key ingredients of biological learning systems, generalization
and incorporation of uncertainty into the decision-making process, to speed
up artificial learning. We present a coherent and fully Bayesian framework
that allows for efficient artificial learning in the absence of expert
knowledge. The success of our learning framework is demonstrated on
challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth, Marco F. Huber, and Uwe D. Hanebeck.
**Analytic
moment-based Gaussian process filtering**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 225-232, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We propose an
analytic moment-based filter for nonlinear stochastic dynamic systems modeled
by Gaussian processes. Exact expressions for the expected value and the
covariance matrix are provided for both the prediction step and the filter
step, where an additional Gaussian assumption is exploited in the latter
case. Our filter does not require further approximations. In particular, it
avoids finite-sample approximations. We compare the filter to a variety of
Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed
by Ko
et al. (2007).

** Comment:** With corrections. code.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Bayesian inference
for efficient learning in control**.
In *Multidisciplinary Symposium on Reinforcement Learning*,
Montréal, QC, Canada, June 2009.

** Abstract:** In
contrast to humans or animals, artificial learners often require more trials
when learning motor control tasks solely based on experience. Efficient
autonomous learners will reduce the amount of engineering required to solve
control problems. By using probabilistic forward models, we can employ two
key ingredients of biological learning systems to speed up artificial
learning. We present a consistent and coherent Bayesian framework that allows
for efficient autonomous experience-based learning. We demonstrate the
success of our learning algorithm by applying it to challenging nonlinear
control problems in simulation and in hardware.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Gaussian process
dynamic programming**.
*Neurocomputing*, 72(7-9):1508-1524, March 2009, doi
10.1016/j.neucom.2008.12.019.

** Abstract:** Reinforcement
learning (RL) and optimal control of systems with continuous states and
actions require approximation techniques in most interesting cases. In this
article, we introduce Gaussian process dynamic programming (GPDP), an
approximate value function-based RL algorithm. We consider both a classic
optimal control problem, where problem-specific prior knowledge is available,
and a classic RL problem, where only very general priors can be used. For the
classic optimal control problem, GPDP models the unknown value functions with
Gaussian processes and generalizes dynamic programming to continuous-valued
states and actions. For the RL problem, GPDP starts from a given initial
state and explores the state space using Bayesian active learning. To design
a fast learner, available data have to be used efficiently. Hence, we propose
to learn probabilistic models of the a priori unknown transition dynamics and
the value functions on the fly. In both cases, we successfully apply the
resulting continuous-valued controllers to the under-actuated pendulum swing
up and analyze the performances of the suggested algorithms. It turns out
that GPDP uses data very efficiently and can be applied to problems, where
classic dynamic programming would be cumbersome.

** Comment:** code.

Carl Edward Rasmussen and Marc Peter Deisenroth.
**Probabilistic
inference for fast learning in control**.
In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, *Recent
Advances in Reinforcement Learning*, volume 5323 of *Lecture Notes in
Computer Science (LNCS)*, pages 229-242, Villeneuve d'Ascq, France,
November 2008. Springer-Verlag.

** Abstract:** We provide a
novel framework for very fast model-based reinforcement learning in
continuous state and action spaces. The framework requires probabilistic
models that explicitly characterize their levels of confidence. Within this
framework, we use flexible, non-parametric models to describe the world based
on previously collected experience. We demonstrate learning on the cart-pole
problem in a setting where we provide very limited prior knowledge about the
task. Learning progresses rapidly, and a good policy is found after only a
hand-full of iterations.

** Comment:** videos and more. slides.

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen.
**Approximate
dynamic programming with Gaussian processes**.
In *2008 American Control Conference (ACC 2008)*, pages 4480-4485,
Seattle, WA, USA, June 2008.

** Abstract:** In general, it is
difficult to determine an optimal closed-loop policy in nonlinear control
problems with continuous-valued state and control domains. Hence,
approximations are often inevitable. The standard method of discretizing
states and controls suffers from the curse of dimensionality and strongly
depends on the chosen temporal sampling rate. The paper introduces Gaussian
Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman
recursion of the dynamic programming algorithm are modeled using Gaussian
processes. GPDP returns an optimal state-feedback for a finite set of states.
Based on these outcomes, we learn a possibly discontinuous closed-loop policy
on the entire state space by switching between two independently trained
Gaussian processes.

** Comment:** code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Model-based
reinforcement learning with continuous states and actions**.
In *Proceedings of the 16th European Symposium on Artificial Neural Networks
(ESANN 2008)*, pages 19-24, Bruges, Belgium, April 2008.

**
Abstract:** Finding an optimal policy in a reinforcement learning (RL)
framework with continuous state and action spaces is challenging. Approximate
solutions are often inevitable. GPDP is an approximate dynamic programming
algorithm based on Gaussian process (GP) models for the value functions. In
this paper, we extend GPDP to the case of unknown transition dynamics. After
building a GP model for the transition dynamics, we apply GPDP to this model
and determine a continuous-valued policy in the entire state space. We apply
the resulting controller to the underpowered pendulum swing up. Moreover, we
compare our results on this RL task to a nearly optimal discrete DP solution
in a fully known environment.

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, and Finale
Doshi-Velez.
**Wide
mean-field Bayesian neural networks ignore the data**.
In *25th International Conference on Artificial Intelligence and
Statistics*, 2022.

** Abstract:** Bayesian neural networks
(BNNs) combine the expressive power of deep learning with the advantages of
Bayesian formalism. In recent years, the analysis of wide, deep BNNs has
provided theoretical insight into their priors and posteriors. However, we
have no analogous insight into their posteriors under approximate inference.
In this work, we show that mean-field variational inference *entirely
fails to model the data* when the network width is large and the
activation function is odd. Specifically, for fully-connected BNNs with odd
activation functions and a homoscedastic Gaussian likelihood, we show that
the *optimal* mean-field variational posterior predictive (i.e.,
function space) distribution converges to the prior predictive distribution
as the width tends to infinity. We generalize aspects of this result to other
likelihoods. Our theoretical results are suggestive of underfitting behavior
previously observered in BNNs. While our convergence bounds are
non-asymptotic and constants in our analysis can be computed, they are
currently too loose to be applicable in standard training regimes. Finally,
we show that the optimal approximate posterior need not tend to the prior if
the activation function is not odd, showing that our statements cannot be
generalized arbitrarily.

Finale Doshi-Velez and Zoubin Ghahramani.
**A comparison of
human and agent reinforcement learning in partially observable
domains**.
In *33rd Annual Meeting of the Cognitive Science Society*, Boston, MA,
2011.

** Abstract:** It is commonly stated that reinforcement
learning (RL) algorithms learn slower than humans. In this work, we
investigate this claim using two standard problems from the RL literature. We
compare the performance of human subjects to RL techniques. We find that
context-the meaningfulness of the observations—-plays a significant role
in the rate of human RL. Moreover, without contextual information, humans
often fare much worse than classic algorithms. Comparing the detailed
responses of humans and RL algorithms, we also find that humans appear to
employ rather different strategies from standard algorithms, even in cases
where they had indistinguishable performance to them. Our research both sheds
light on human RL and provides insights for improving RL algorithms.

Finale Doshi-Velez.
**The infinite partially
observable Markov decision process**.
In *Advances in Neural Information Processing Systems 23*, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:** The
Partially Observable Markov Decision Process (POMDP) framework has proven
useful in planning domains where agents must balance actions that provide
knowledge and actions that provide reward. Unfortunately, most POMDPs are
complex structures with a large number of parameters. In many real-world
problems, both the structure and the parameters are difficult to specify from
domain knowledge alone. Recent work in Bayesian reinforcement learning has
made headway in learning POMDP models; however, this work has largely focused
on learning the parameters of the POMDP model. We define an infinite POMDP
(iPOMDP) model that does not require knowledge of the size of the state
space; instead, it assumes that the number of visited states will grow as the
agent explores its world and only models visited states explicitly. We
demonstrate the iPOMDP on several standard problems.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Finale Doshi-Velez.
**The Indian buffet
process: Scalable inference and extensions**.
Master's thesis, University of Cambridge, Cambridge, UK, August 2009.

** Abstract:** Many unsupervised learning problems seek to
identify hidden features from observations. In many real-world situations,
the number of hidden features is unknown. To avoid specifying the number of
hidden features a priori, one can use the Indian Buffet Process (IBP): a
nonparametric latent feature model that does not bound the number of active
features in a dataset. While elegant, the lack of efficient inference
procedures for the IBP has prevented its application in large-scale problems.
The core contribution of this thesis are three new inference procedures that
allow inference in the IBP to be scaled from a few hundred to 100,000
observations. This thesis contains three parts: (1) An introduction to the
IBP and a review of inference techniques and extensions. The first chapters
summarise three constructions for the IBP and review all currently published
inference techniques. Appendix C reviews extensions of the IBP to date. (2)
Novel techniques for scalable Bayesian inference. This thesis presents three
new inference procedures: (a) an accelerated Gibbs sampler for efficient
Bayesian inference in a broad class of conjugate models, (b) a parallel,
asynchronous Gibbs sampler that allows the accelerated Gibbs sampler to be
distributed across multiple processors, and (c) a variational inference
procedure for the IBP. (3) A framework for structured nonparametric latent
feature models. We also present extensions to the IBP to model more
sophisticated relationships between the co-occurring hidden features,
providing a general framework for correlated non-parametric feature
models.

F. Doshi-Velez and Z. Ghahramani.
**Correlated
non-parametric latent feature models**.
In *Conference on Uncertainty in Artificial Intelligence (UAI 2009)*,
pages 143-150, Montréal, QC, Canada, June 2009. AUAI Press.

** Abstract:** We are often interested in explaining data
through a set of hidden factors or features. To allow for an unknown number
of such hidden features, one can use the IBP: a non-parametric latent feature
model that does not bound the number of active features in a dataset.
However, the IBP assumes that all latent features are uncorrelated, making it
inadequate for many real-world problems. We introduce a framework for
correlated non-parametric feature models, generalising the IBP. We use this
framework to generate several specific models and demonstrate applications on
real-world datasets.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated Gibbs
sampling for the Indian buffet process**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 273-280, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We often seek to
identify co-occurring hidden features in a set of observations. The Indian
Buffet Process (IBP) provides a non-parametric prior on the features present
in each observation, but current inference techniques for the IBP often scale
poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in
the number of observations, and the uncollapsed Gibbs sampler, while linear,
is often slow to mix. We present a new linear-time collapsed Gibbs sampler
for conjugate likelihood models and demonstrate its efficacy on large
real-world datasets.

F. Doshi-Velez, K.T. Miller, J. Van Gael, and Y.W. Teh.
**Variational
inference for the Indian buffet process**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 12, pages 137-144, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.

**
Abstract:** The Indian Buffet Process (IBP) is a nonparametric prior for
latent feature models in which observations are influenced by a combination
of hidden features. For example, images may be composed of several objects
and sounds may consist of several notes. Latent feature models seek to infer
these unobserved features from a set of observations; the IBP provides a
principled prior in situations where the number of hidden features is
unknown. Current inference methods for the IBP have all relied on sampling.
While these methods are guaranteed to be accurate in the limit, samplers for
the IBP tend to mix slowly in practice. We develop a deterministic
variational method for inference in the IBP based on a truncated
stick-breaking approximation, provide theoretical bounds on the truncation
error, and evaluate our method in several data regimes.

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh.
**Variational
inference for the Indian buffet process**.
Technical Report CBL-2009-001, University of Cambridge, Computational and
Biological Learning Laboratory, Department of Engineering, April 2009.

** Abstract:** The Indian Buffet Process (IBP) is a
nonparametric prior for latent feature models in which observations are
influenced by a combination of hidden features. For example, images may be
composed of several objects and sounds may consist of several notes. Latent
feature models seek to infer these unobserved features from a set of
observations; the IBP provides a principled prior in situations where the
number of hidden features is unknown. Current inference methods for the IBP
have all relied on sampling. While these methods are guaranteed to be
accurate in the limit, samplers for the IBP tend to mix slowly in practice.
We develop a deterministic variational method for inference in the IBP based
on truncating to infinite models, provide theoretical bounds on the
truncation error, and evaluate our method in several data regimes. This
technical report is a longer version of Doshi-Velez et al. (2009).

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated
sampling for the Indian buffet process**.
In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors,
*ICML*, volume 382 of *ACM International Conference Proceeding
Series*, page 35. acm, 2009.

** Abstract:** We often
seek to identify co-occurring hidden features in a set of observations. The
Indian Buffet Process (IBP) provides a nonparametric prior on the features
present in each observation, but current inference techniques for the IBP
often scale poorly. The collapsed Gibbs sampler for the IBP has a running
time cubic in the number of observations, and the uncollapsed Gibbs sampler,
while linear, is often slow to mix. We present a new linear-time collapsed
Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on
large real-world datasets.

Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, and Fergus Simpson.
**Neural diffusion
processes**.
In *arXiv*, Online, Apr 2022.

** Abstract:** Gaussian
processes provide an elegant framework for specifying prior and posterior
distributions over functions. They are, however, also computationally
expensive, and limited by the expressivity of their covariance function. We
propose Neural Diffusion Processes (NDPs), a novel approach based upon
diffusion models, that learn to sample from distributions over functions.
Using a novel attention block, we can incorporate properties of stochastic
processes, such as exchangeability, directly into the NDP's architecture. We
empirically show that NDPs are able to capture functional distributions that
are close to the true Bayesian posterior of a Gaussian process. This enables
a variety of downstream tasks, including hyperparameter marginalisation and
Bayesian optimisation.

Vincent Dutordoir, James Hensman, Mark van der Wilk, Carl Henrik Ek, Zoubin
Ghahramani, and Nicolas Durrande.
**Deep
neural networks as point estimates for deep Gaussian processes**.
In *Advances in Neural Information Processing Systems 34*, Online, Dec
2021.

** Abstract:** Neural networks and Gaussian processes
are complementary in their strengths and weaknesses. Having a better
understanding of their relationship comes with the promise to make each
method benefit from the strengths of the other. In this work, we establish an
equivalence between the forward passes of neural networks and (deep) sparse
Gaussian process models. The theory we develop is based on interpreting
activation functions as interdomain inducing features through a rigorous
analysis of the interplay between activation functions and kernels. This
results in models that can either be seen as neural networks with improved
uncertainty prediction or deep Gaussian processes with increased prediction
accuracy. These claims are supported by experimental results on regression
and classification datasets.

Vincent Dutordoir, Nicolas Durrande, and James Hensman.
**Sparse
Gaussian processes with spherical harmonic features**.
In *37th International Conference on Machine Learning*, Online, June
2020.

** Abstract:** We introduce a new class of inter-domain
variational Gaussian processes (GP) where data is mapped onto the unit
hypersphere in order to use spherical harmonic representations. Our inference
scheme is comparable to variational Fourier features, but it does not suffer
from the curse of dimensionality, and leads to diagonal covariance matrices
between inducing variables. This enables a speed-up in inference, because it
bypasses the need to invert large covariance matrices. Our experiments show
that our model is able to fit a regression model for a dataset with 6 million
entries two orders of magnitude faster compared to standard sparse GPs, while
retaining state of the art accuracy. We also demonstrate competitive
performance on classification with non-conjugate likelihoods.

Vincent Dutordoir, Hugh Salimbeni, Marc Deisenroth, and James Hensman.
**Gaussian
process conditional density estimation**.
In *Advances in Neural Information Processing Systems 32*, Montréal,
Canada, Dec 2018.

** Abstract:** Conditional Density
Estimation (CDE) models deal with estimating conditional distributions. The
conditions imposed on the distribution are the inputs of the model. CDE is a
challenging task as there is a fundamental trade-off between model
complexity, representational capacity and overfitting. In this work, we
propose to extend the model's input with latent variables and use Gaussian
processes (GP) to map this augmented input onto samples from the conditional
distribution. Our Bayesian approach allows for the modeling of small
datasets, but we also provide the machinery for it to be applied to big data
using stochastic variational inference. Our approach can be used to model
densities even in sparse data regions, and allows for sharing learned
structure between conditions. We illustrate the effectiveness and
wide-reaching applicability of our model on a variety of real- world
problems, such as spatio-temporal density estimation of taxi drop-offs,
non-Gaussian noise modeling, and few-shot learning on omniglot images.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Michael Schober, David Duvenaud, and Philipp Hennig.
**Probabilistic ODE solvers with
Runge-Kutta means**.
*arXiv preprint arXiv:1406.2582*, June 2014.

**
Abstract:** Runge-Kutta methods are the classic family of solvers for
ordinary differential equations (ODEs), and the basis for the
state-of-the-art. Like most numerical methods, they return point estimates.
We construct a family of probabilistic numerical methods that instead return
a Gauss-Markov process defining a probability distribution over the ODE
solution. In contrast to prior work, we construct this family such that
posterior means match the outputs of the Runge-Kutta family exactly, thus
inheriting their proven good properties. Remaining degrees of freedom not
identified by the match to Runge-Kutta are chosen such that the posterior
probability measure fits the observed structure of the ODE. Our results shed
light on the structure of Runge-Kutta solvers from a new direction, provide a
richer, probabilistic output, have low computational cost, and raise new
research questions.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen.
**Additive
Gaussian processes**.
In *Advances in Neural Information Processing Systems 24*, pages
226-234, Granada, Spain, 2011.

** Abstract:** We introduce a
Gaussian process model of functions which are additive. An additive function
is one which decomposes into a sum of low-dimensional functions, each
depending on only a subset of the input variables. Additive GPs generalize
both Generalized Additive Models, and the standard GP models which use
squared-exponential kernels. Hyperparameter learning in this model can be
seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an
expressive but tractable parameterization of the kernel function, which
allows efficient evaluation of all input interaction terms, whose number is
exponential in the input dimension. The additional structure discoverable by
this model results in increased interpretability, as well as state-of-the-art
predictive power in regression tasks.

Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani.
**Training
generative neural networks via Maximum Mean Discrepancy
optimization**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
258-267, Amsterdam, The Netherlands, July 2015.

**
Abstract:** We consider training a deep neural network to generate samples
from an unknown distribution given i.i.d. data. We frame learning as an
optimization minimizing a two-sample test statistic-informally speaking, a
good generator network produces samples that cause a two-sample test to fail
to reject the null hypothesis. As our two-sample test statistic, we use an
unbiased estimate of the maximum mean discrepancy, which is the centerpiece
of the nonparametric kernel two-sample test proposed by Gretton et al.
(2012). We compare to the adversarial nets framework introduced by Goodfellow
et al. (2014), in which learning is a two-player game between a generator
network and an adversarial discriminator network, both trained to outwit the
other. From this perspective, the MMD statistic plays the role of the
discriminator. In addition to empirical comparisons, we prove bounds on the
generalization error incurred by optimizing the empirical MMD.

Frederik Eaton and Zoubin Ghahramani.
**Model reductions
for inference: Generality of pairwise, binary, and planar factor
graphs**.
*Neural Computation*, 25(5):1213-1260, 2013.

**
Abstract:** We offer a solution to the problem of efficiently translating
algorithms between different types of discrete statistical model. We
investigate the expressive power of three classes of model-those with binary
variables, with pairwise factors, and with planar topology-as well as their
four intersections. We formalize a notion of "simple reduction" for the
problem of inferring marginal probabilities and consider whether it is
possible to "simply reduce" marginal inference from general discrete factor
graphs to factor graphs in each of these seven subclasses. We characterize
the reducibility of each class, showing in particular that the class of
binary pairwise factor graphs is able to simply reduce only positive models.
We also exhibit a continuous "spectral reduction" based on polynomial
interpolation, which overcomes this limitation. Experiments assess the
performance of standard approximate inference algorithms on the outputs of
our reductions.

Frederik Eaton and Zoubin Ghahramani.
**Choosing a variable
to clamp: Approximate inference using conditioned belief
propagation**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 145-152,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning
Research.

** Abstract:** In this paper we propose an algorithm
for approximate inference on graphical models based on belief propagation
(BP). Our algorithm is an approximate version of Cutset Conditioning, in
which a subset of variables is instantiated to make the rest of the graph
singly connected. We relax the constraint of single-connectedness, and select
variables one at a time for conditioning, running belief propagation after
each selection. We consider the problem of determining the best variable to
clamp at each level of recursion, and propose a fast heuristic which applies
back-propagation to the BP updates. We demonstrate that the heuristic
performs better than selecting variables at random, and give experimental
results which show that it performs competitively with existing approximate
inference algorithms.

** Comment:** Code (in C++
based on libDAI).

Gergely Flamich, Stratis Markou, and José Miguel Hernández-Lobato.
**Fast relative entropy coding with
A* coding**.
In *39th International Conference on Machine Learning*, 2022.

** Abstract:** Relative entropy coding (REC) algorithms encode a
sample from a target distribution $Q$ using a proposal distribution $P$, such
that the expected codelength is $\mathcalO(D_KL[Q||P])$. REC can be
seamlessly integrated with existing learned compression models since, unlike
entropy coding, it does not assume discrete $Q$ or $P$, and does not require
quantisation. However, general REC algorithms require an intractable
$Ømega(e^D_KL[Q||P])$ runtime. We introduce AS* and AD* coding, two REC
algorithms based on A* sampling. We prove that, for continuous distributions
over $\mathbbR$, if the density ratio is unimodal, AS* has
$\mathcalO(D_[Q||P]QP)$ expected runtime, where
$D_[Q||P]QP$ is the Rényi $\infty$-divergence. We provide
experimental evidence that AD* also has $\mathcalO(D_[Q||P]QP)$
expected runtime. We prove that AS* and AD* achieve an expected codelength of
$\mathcalO(D_KL[Q||P])$. Further, we introduce DAD*, an approximate
algorithm based on AD* which retains its favourable runtime and has bias
similar to that of alternative methods. Focusing on VAEs, we propose the
IsoKL VAE (IKVAE), which can be used with DAD* to further improve compression
efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can
losslessly compress images near the theoretically optimal limit.

Wessel P. Bruinsma, James Requeima, Andrew Y. K. Foong, Jonathan Gordon, and
Richard E. Turner.
**The Gaussian neural
process**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Neural Processes (NPs; Garnelo et al.,
2018a,b) are a rich class of models for meta-learning that map data sets
directly to predictive stochastic processes. We provide a rigorous analysis
of the standard maximum-likelihood objective used to train conditional NPs.
Moreover, we propose a new member to the Neural Process family called the
Gaussian Neural Process (GNP), which models predictive correlations,
incorporates translation equivariance, provides universal approximation
guarantees, and demonstrates encouraging performance.

Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, and Richard E. Turner.
**How
tight can PAC-Bayes be in the small data regime?**.
In *Advances in Neural Information Processing Systems 34*. Curran
Associates, Inc., 2021.

** Abstract:** In this paper, we
investigate the question: Given a small number of datapoints, for example N =
30, how tight can PAC-Bayes and test set bounds be made? For such small
datasets, test set bounds adversely affect generalisation performance by
withholding data from the training procedure. In this setting, PAC-Bayes
bounds are especially attractive, due to their ability to use all the data to
simultaneously learn a posterior and bound its generalisation risk. We focus
on the case of i.i.d. data with a bounded loss and consider the generic
PAC-Bayes theorem of Germain et al. While their theorem is known to recover
many existing PAC-Bayes bounds, it is unclear what the tightest bound
derivable from their framework is. For a fixed learning algorithm and
dataset, we show that the tightest possible bound coincides with a bound
considered by Catoni; and, in the more natural case of distributions over
datasets, we establish a lower bound on the best bound achievable in
expectation. Interestingly, this lower bound recovers the Chernoff test set
bound if the posterior is equal to the prior. Moreover, to illustrate how
tight these bounds can be, we study synthetic one-dimensional classification
tasks in which it is feasible to meta-learn both the prior and the form of
the bound to numerically optimise for the tightest bounds possible. We ind
that in this simple, controlled scenario, PAC-Bayes bounds are competitive
with comparable, commonly used Chernoff test set bounds. However, the
sharpest test set bounds still lead to better guarantees on the
generalisation error than the PAC-Bayes bounds we consider.

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann
Dubois, and Richard Turner.
**Convolutional
conditional neural processes**.
In *8th International Conference on Learning Representations*, Adis
Ababa, April 2020.

** Abstract:** We introduce the
Convolutional Conditional Neural Process (ConvCNP), a new member of the
Neural Process family that models translation equivariance in the data.
Translation equivariance is an important inductive bias for many learning
problems including time series modelling, spatial data, and images. The model
embeds data sets into an infinite-dimensional function space, as opposed to
finite-dimensional vector spaces. To formalize this notion, we extend the
theory of neural representations of sets to include functional
representations, and demonstrate that any translation-equivariant embedding
can be represented using a convolutional deep-set. We evaluate ConvCNPs in
several settings, demonstrating that they achieve state-of-the-art
performance compared to existing NPs. We demonstrate that building in
translation equivariance enables zero-shot generalization to challenging,
out-of-domain tasks.

Andrew Y. K. Foong, Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, James
Requeima, and Richard E. Turner.
**Meta-learning
stationary stochastic process prediction with convolutional neural
processes**.
In *Advances in Neural Information Processing Systems 33*. Curran
Associates, Inc., 2020.

** Abstract:** Stationary stochastic
processes (SPs) are a key component of many probabilistic models, such as
those for off-the-grid spatio-temporal data. They enable the statistical
symmetry of underlying physical phenomena to be leveraged, thereby aiding
generalization. Prediction in such models can be viewed as a translation
equivariant map from observed data sets to predictive SPs, emphasizing the
intimate relationship between stationarity and equivariance. Building on
this, we propose the Convolutional Neural Process (ConvNP), which endows
Neural Processes (NPs) with translation equivariance and extends
convolutional conditional NPs to allow for dependencies in the predictive
distribution. The latter enables ConvNPs to be deployed in settings which
require coherent samples, such as Thompson sampling or conditional image
completion. Moreover, we propose a new maximum-likelihood objective to
replace the standard ELBO objective in NPs, which conceptually simplifies the
framework and empirically improves performance. We demonstrate the strong
performance and generalization capabilities of ConvNPs on 1D regression,
image completion, and various tasks with real-world spatio-temporal data.

Andrew Foong, David Burt, Yingzhen Li, and Richard Turner.
**On the expressiveness of approximate inference in bayesian neural
networks**.
In *Advances in Neural Information Processing Systems 34*, 2020.

James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan
Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji
Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, and
Rodolphe Jenatton.
**Sparse MoEs meet
efficient ensembles**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Machine learning models based on the aggregated outputs of
submodels, either at the activation or prediction levels, often exhibit
strong performance compared to individual models. We study the interplay of
two popular classes of such models: ensembles of neural networks and sparse
mixture of experts (sparse MoEs). First, we show that the two approaches have
complementary features whose combination is beneficial. This includes a
comprehensive evaluation of sparse MoEs in uncertainty related benchmarks.
Then, we present efficient ensemble of experts (E^{3}), a scalable
and simple ensemble of sparse MoEs that takes the best of both classes of
models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive
experiments demonstrate the accuracy, log-likelihood, few-shot learning,
robustness, and uncertainty improvements of E^{3} over several
challenging vision Transformer-based baselines. E^{3} not only
preserves its efficiency while scaling to models with up to 2.7B parameters,
but also provides better predictive performance and uncertainty estimates for
larger models.

** Comment:** Code

Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham,
Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent,
Rodolphe Jenatton, and Effrosyni Kokiopoulou.
**Deep classifiers with
label noise modeling and distance awareness**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Uncertainty estimation in deep learning has recently emerged as
a crucial area of interest to advance reliability and robustness in
safety-critical applications. While there have been many proposed methods
that either focus on distance-aware model uncertainties for
out-of-distribution detection or on input-dependent label uncertainties for
in-distribution calibration, both of these types of uncertainty are often
necessary. In this work, we propose the HetSNGP method for jointly modeling
the model and data uncertainty. We show that our proposed model affords a
favorable combination between these two types of uncertainty and thus
outperforms the baseline methods on some challenging out-of-distribution
datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we
propose HetSNGP Ensemble, an ensembled version of our method which
additionally models uncertainty over the network parameters and outperforms
other ensemble baselines.

** Comment:** Code

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel,
Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, and Laurence
Aitchison.
**Bayesian neural network
priors revisited**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Isotropic Gaussian priors are the de facto
standard for modern Bayesian neural network inference. However, it is unclear
whether these priors accurately reflect our true beliefs about the weight
distributions or give optimal performance. To find better priors, we study
summary statistics of neural network weights in networks trained using
stochastic gradient descent (SGD). We find that convolutional neural network
(CNN) and ResNet weights display strong spatial correlations, while fully
connected networks (FCNNs) display heavy-tailed weight distributions. We show
that building these observations into priors can lead to improved performance
on a variety of image classification datasets. Surprisingly, these priors
mitigate the cold posterior effect in FCNNs, but slightly increase the cold
posterior effect in ResNets.

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, and
Gunnar Rätsch.
**Scalable
Gaussian process variational autoencoders**.
In Arindam Banerjee and Kenji Fukumizu, editors, *Proceedings of The 24th
International Conference on Artificial Intelligence and Statistics*,
volume 130 of *Proceedings of Machine Learning Research*, pages
3511-3519. Proceedings of Machine Learning Research, 13-15 Apr 2021.

** Abstract:** Conventional variational autoencoders fail in
modeling correlations between data points due to their use of factorized
priors. Amortized Gaussian process inference through GP-VAEs has led to
significant improvements in this regard, but is still inhibited by the
intrinsic complexity of exact GP inference. We improve the scalability of
these methods through principled sparse inference approaches. We propose a
new scalable GP-VAE model that outperforms existing approaches in terms of
runtime and memory footprint, is easy to implement, and allows for joint
end-to-end optimization of all components.

Matthew Ashman, Jonny So, Will Tebbutt, Vincent Fortuin, Michael Pearce, and
Richard E. Turner.
**Sparse Gaussian process
variational autoencoders**.
2020.

** Abstract:** Large, multi-dimensional spatio-temporal
datasets are omnipresent in modern science and engineering. An effective
framework for handling such data are Gaussian process deep generative models
(GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing
approaches for performing inference in GP-DGMs do not support sparse GP
approximations based on inducing points, which are essential for the
computational efficiency of GPs, nor do they handle missing data - a natural
occurrence in many spatio-temporal datasets - in a principled manner. We
address these shortcomings with the development of the sparse Gaussian
process variational autoencoder (SGP-VAE), characterised by the use of
partial inference networks for parameterising sparse GP approximations.
Leveraging the benefits of amortised variational inference, the SGP-VAE
enables inference in multi-output sparse GPs on previously unobserved data
with no additional training. The SGP-VAE is evaluated in a variety of
experiments where it outperforms alternative approaches including
multi-output GPs and structured VAEs.

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
In *31st AAAI Conference on Artificial Intelligence*, San Francisco, CA,
USA, January 2017. AAAI Press.

** Abstract:** Circular
variables arise in a multitude of data-modelling contexts ranging from
robotics to the social sciences, but they have been largely overlooked by the
machine learning community. This paper partially redresses this imbalance by
extending some standard probabilistic modelling tools to the circular domain.
First we introduce a new multivariate distribution over circular variables,
called the multivariate Generalised von Mises (mGvM) distribution. This
distribution can be constructed by restricting and renormalising a general
multivariate Gaussian distribution to the unit hyper-torus. Previously
proposed multivariate circular distributions are shown to be special cases of
this construction. Second, we introduce a new probabilistic model for
circular regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg.
**Bayesian generalised ensemble Markov chain Monte Carlo**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:**
Bayesian generalised ensemble (BayesGE) is a new method that addresses two
major drawbacks of standard Markov chain Monte Carlo algorithms for inference
in high-dimensional probability models: inapplicability to estimate the
partition function, and poor mixing properties. BayesGE uses a Bayesian
approach to iteratively update the belief about the density of states
(distribution of the log likelihood under the prior) for the model, with the
dual purpose of enhancing the sampling efficiency and make the estimation of
the partition function tractable. We benchmark BayesGE on Ising and Potts
systems and show that it compares favourably to existing state-of-the-art
methods.

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas
Hamelryck, Kresten Lindorff-Larsen, and Michele Vendruscolo.
**Equilibrium simulations of proteins using molecular fragment replacement and
NMR chemical shifts**.
*Proceedings of the National Academy of Sciences*, 111(38):13852-13857,
2014, doi
10.1073/pnas.1404948111.

** Abstract:** Methods of protein
structure determination based on NMR chemical shifts are becoming
increasingly common. The most widely used approaches adopt the molecular
fragment replacement strategy, in which structural fragments are repeatedly
reassembled into different complete conformations in molecular simulations.
Although these approaches are effective in generating individual structures
consistent with the chemical shift data, they do not enable the sampling of
the conformational space of proteins with correct statistical weights. Here,
we present a method of molecular fragment replacement that makes it possible
to perform equilibrium simulations of proteins, and hence to determine their
free energy landscapes. This strategy is based on the encoding of the
chemical shift information in a probabilistic model in Markov chain Monte
Carlo simulations. First, we demonstrate that with this approach it is
possible to fold proteins to their native states starting from extended
structures. Second, we show that the method satisfies the detailed balance
condition and hence it can be used to carry out an equilibrium sampling from
the Boltzmann distribution corresponding to the force field used in the
simulations. Third, by comparing the results of simulations carried out with
and without chemical shift restraints we describe quantitatively the effects
that these restraints have on the free energy landscapes of proteins. Taken
together, these results demonstrate that the molecular fragment replacement
strategy can be used in combination with chemical shift information to
characterize not only the native structures of proteins but also their
conformational fluctuations.

Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg.
**Combining the
multicanonical ensemble with generative probabilistic models of local
biomolecular structure**.
In *Proceedings of the 59th World Statistics Congress of the
International Statistical Institute*, pages 139-144, Hong Kong,
2014.

** Abstract:** Markov chain Monte Carlo is a powerful
tool for sampling complex systems such as large biomolecular structures.
However, the standard Metropolis-Hastings algorithm suffers from a number of
deficiencies when applied to systems with rugged free-energy landscapes. Some
of these deficiencies can be addressed with the multicanonical ensemble. In
this paper we will present two strategies for applying the multicanonical
ensemble to distributions constructed from generative probabilistic models of
local biomolecular structure. In particular, we will describe how to use the
multicanonical ensemble efficiently in conjunction with the reference ratio
method.

Peter Kerpedjiev, Jes Frellsen, Stinus Lindgreen, and Anders Krogh.
**Adaptable probabilistic mapping of short reads using position specific
scoring matrices**.
*BMC bioinformatics*, 15:100, 2014, doi
10.1186/1471-2105-15-100.

** Abstract:** BACKGROUND:
Modern DNA sequencing methods produce vast amounts of data that often
requires mapping to a reference genome. Most existing programs use the number
of mismatches between the read and the genome as a measure of quality. This
approach is without a statistical foundation and can for some data types
result in many wrongly mapped reads. Here we present a probabilistic mapping
method based on position-specific scoring matrices, which can take into
account not only the quality scores of the reads but also user-specified
models of evolution and data-specific biases.RESULTS:We show how evolution,
data-specific biases, and sequencing errors are naturally dealt with
probabilistically. Our method achieves better results than Bowtie and BWA on
simulated and real ancient and PAR-CLIP reads, as well as on simulated reads
from the AT rich organism P. falciparum, when modeling the biases of these
data. For simulated Illumina reads, the method has consistently higher
sensitivity for both single-end and paired-end data. We also show that our
probabilistic approach can limit the problem of random matches from short
reads of contamination and that it improves the mapping of real reads from
one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION:
The presented work is an implementation of a novel approach to short read
mapping where quality scores, prior mismatch probabilities and mapping
qualities are handled in a statistically sound manner. The resulting
implementation provides not only a tool for biologists working with low
quality and/or biased sequencing data but also a demonstration of the
feasibility of using a probability based alignment method on real and
simulated data sets.

** Comment:** Peter Kerpedjiev and Jes Frellsen contributed
equally. Additional resources are available at bwa-pssm.binf.ku.dk

Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, and Anders
Krogh.
**On the accuracy of short read mapping**.
In *Deep Sequencing Data Analysis*, volume 1038, pages 39-59. Springer,
2013.

** Abstract:** The development of high-throughput
sequencing technologies has revolutionized the way we study genomes and gene
regulation. In a single experiment, millions of reads are produced. To gain
knowledge from these experiments the first thing to be done is finding the
genomic origin of the reads, i.e., mapping the reads to a reference genome.
In this new situation, conventional alignment tools are obsolete, as they
cannot handle this huge amount of data in a reasonable amount of time. Thus,
new mapping algorithms have been developed, which are fast at the expense of
a small decrease in accuracy. In this chapter we discuss the current problems
in short read mapping and show that mapping reads correctly is a nontrivial
task. Through simple experiments with both real and synthetic data, we
demonstrate that different mappers can give different results depending on
the type of data, and that a considerable fraction of uniquely mapped reads
is potentially mapped to an incorrect location. Furthermore, we provide
simple statistical results on the expected number of random matches in a
genome (E-value) and the probability of a random match as a function of read
length. Finally, we show that quality scores contain valuable information for
mapping and why mapping quality should be evaluated in a probabilistic
manner. In the end, we discuss the potential of improving the performance of
current methods by considering these quality scores in a probabilistic
mapping program.

** Comment:** Peter Menzel and Jes Frellsen contributed
equally.

Roger Frigola.
**Bayesian time series
learning with Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** The analysis of time series data is
important in fields as disparate as the social sciences, biology, engineering
or econometrics. In this dissertation, we present a number of algorithms
designed to learn Bayesian nonparametric models of time series. The goal of
these kinds of models is twofold. First, they aim at making predictions which
quantify the uncertainty due to limitations in the quantity and the quality
of the data. Second, they are flexible enough to model highly complex data
whilst preventing overfitting when the data does not warrant complex
models.

We begin with a unifying literature review on time series models
based on Gaussian processes. Then, we centre our attention on the Gaussian
Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation
of discrete-time nonlinear state-space models. We present a novel formulation
of the GP-SSM that offers new insights into its properties. We then proceed
to exploit those insights by developing new learning algorithms for the
GP-SSM based on particle Markov chain Monte Carlo and variational
inference.

Finally, we present a filtered nonlinear auto-regressive
model with a simple, robust and fast learning algorithm that makes it well
suited to its application by non-experts on large datasets. Its main
advantage is that it avoids the computationally expensive (and potentially
difficult to tune) smoothing step that is a key part of learning nonlinear
state-space models.

Roger Frigola, Yutian Chen, and Carl Edward Rasmussen.
**Variational
Gaussian process state-space models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*,
2014.

** Abstract:** State-space models have been successfully
used for more than fifty years in different areas of science and engineering.
We present a procedure for efficient variational Bayesian learning of
nonlinear state-space models based on sparse Gaussian processes. The result
of learning is a tractable posterior over nonlinear dynamical systems. In
comparison to conventional parametric models, we offer the possibility to
straightforwardly trade off model capacity and computational cost whilst
avoiding overfitting. Our main algorithm uses a hybrid inference approach
combining variational Bayes and sequential Monte Carlo. We also present
stochastic variational inference and online learning approaches for fast
learning with long time series.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Identification of Gaussian
process state-space models with particle stochastic approximation
EM**.
In *Proceedings of the 19th World Congress of the International Federation
of Automatic Control (IFAC)*, 2014.

** Abstract:**
Gaussian process state-space models (GP-SSMs) are a very flexible family of
models of nonlinear dynamical systems. They comprise a Bayesian nonparametric
representation of the dynamics of the system and additional
(hyper-)parameters governing the properties of this nonparametric
representation. The Bayesian formalism enables systematic reasoning about the
uncertainty in the system dynamics. We present an approach to maximum
likelihood identification of the parameters in GP-SSMs, while retaining the
full nonparametric description of the dynamics. The method is based on a
stochastic approximation version of the EM algorithm that employs recent
developments in particle Markov chain Monte Carlo for efficient
identification.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2013.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

G. Kasneci, J. Van Gael, T. Graepel, and R. Herbrich.
**Bayesian
knowledge corroboration with logical rules and user feedback**.
In *European Conference on Machine Learning (ECML)*, Barcelona, Spain,
September 2010.

** Abstract:** Current knowledge bases suffer
from either low coverage or low accuracy. The underlying hypothesis of this
work is that user feedback can greatly improve the quality of automatically
extracted knowledge bases. The feedback could help quantify the uncertainty
associated with the stored statements and would enable mechanisms for
searching, ranking and reasoning at entity-relationship level. Most
importantly, a principled model for exploiting user feedback to learn the
truth values of statements in the knowledge base would be a major step
forward in addressing the issue of knowledge base curation. We present a
family of probabilistic graphical models that builds on user feedback and
logical inference rules derived from the popular Semantic-Web formalism of
RDFS [1]. Through internal
inference and belief propagation, these models can learn both, the truth
values of the statements in the knowledge base and the reliabilities of the
users who give feedback. We demonstrate the viability of our approach in
extensive experiments on real-world datasets, with feedback collected from
Amazon Mechanical Turk.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani.
**Traffic
classification in information poor environments**.
In *1st International Workshop on Traffic Analysis and Classification (IWCMC
'10)*, Caen, France, July 2010.

** Abstract:** Traffic
classification using machine learning continues to be an active research
area. The majority of work in this area uses *off-the-shelf* machine
learning tools and treats them as *black-box* classifiers. This approach
turns all the modelling complexity into a feature selection problem. In this
paper, we build a problem-specific solution to the traffic classification
problem by designing a custom probabilistic graphical model. Graphical models
are a modular framework to design classifiers which incorporate
domain-specific knowledge. More specifically, our solution introduces
semi-supervised learning which means we learn from both labelled and
unlabelled traffic flows. We show that our solution performs competitively
compared to previous approaches while using less data and simpler
features.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani.
**Probabilistic
graphical models for semi-supervised traffic classification**.
In *The 6th International Wireless Communications and Mobile Computing
Conference*, pages 752-757, Caen, France, 2010.

**
Abstract:** Traffic classification using machine learning continues to be
an active research area. The majority of work in this area uses off-the-shelf
machine learning tools and treats them as black-box classifiers. This
approach turns all the modelling complexity into a feature selection problem.
In this paper, we build a problem-specific solution to the traffic
classification problem by designing a custom probabilistic graphical model.
Graphical models are a modular framework to design classifiers which
incorporate domain-specific knowledge. More specifically, our solution
introduces semi-supervised learning which means we learn from both labelled
and unlabelled traffic flows. We show that our solution performs
competitively compared to previous approaches while using less data and
simpler features.

J. Van Gael, A. Vlachos, and Z. Ghahramani.
**The infinite
HMM for unsupervised PoS tagging**.
In *Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 678-687, Singapore, August 2009.
Association for Computational Linguistics.

** Abstract:** We
extend previous work on fully unsupervised part-of-speech tagging. Using a
non-parametric version of the HMM, called the infinite HMM (iHMM), we address
the problem of choosing the number of hidden states in unsupervised Markov
models for PoS tagging. We experiment with two non-parametric priors, the
Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using
a parallelized implementation of an iHMM inference algorithm. We evaluate the
results with a variety of clustering evaluation metrics and achieve
equivalent or better performances than previously reported. Building on this
promising result we evaluate the output of the unsupervised PoS tagger as a
direct replacement for the output of a fully supervised PoS tagger for the
task of shallow parsing and compare the two evaluations.

F. Doshi-Velez, K.T. Miller, J. Van Gael, and Y.W. Teh.
**Variational
inference for the Indian buffet process**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 12, pages 137-144, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.

**
Abstract:** The Indian Buffet Process (IBP) is a nonparametric prior for
latent feature models in which observations are influenced by a combination
of hidden features. For example, images may be composed of several objects
and sounds may consist of several notes. Latent feature models seek to infer
these unobserved features from a set of observations; the IBP provides a
principled prior in situations where the number of hidden features is
unknown. Current inference methods for the IBP have all relied on sampling.
While these methods are guaranteed to be accurate in the limit, samplers for
the IBP tend to mix slowly in practice. We develop a deterministic
variational method for inference in the IBP based on a truncated
stick-breaking approximation, provide theoretical bounds on the truncation
error, and evaluate our method in several data regimes.

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh.
**Variational
inference for the Indian buffet process**.
Technical Report CBL-2009-001, University of Cambridge, Computational and
Biological Learning Laboratory, Department of Engineering, April 2009.

** Abstract:** The Indian Buffet Process (IBP) is a
nonparametric prior for latent feature models in which observations are
influenced by a combination of hidden features. For example, images may be
composed of several objects and sounds may consist of several notes. Latent
feature models seek to infer these unobserved features from a set of
observations; the IBP provides a principled prior in situations where the
number of hidden features is unknown. Current inference methods for the IBP
have all relied on sampling. While these methods are guaranteed to be
accurate in the limit, samplers for the IBP tend to mix slowly in practice.
We develop a deterministic variational method for inference in the IBP based
on truncating to infinite models, provide theoretical bounds on the
truncation error, and evaluate our method in several data regimes. This
technical report is a longer version of Doshi-Velez et al. (2009).

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

A.B. Goldberg, X. Zhu, J. Van Gael, and D. Andrzejewski.
**Improving
diversity in ranking using absorbing random walks**.
In *Proceedings of NAACL HLT*, pages 97-104, Rochester, NY, USA, April
2007.

J. Van Gael and X. Zhu.
**Correlation
clustering for crosslingual link detection**.
In Manuela M. Veloso, editor, *International Joint Conference on Artificial
Intelligence*, pages 1744-1749, Hyderabad, India, January 2007.

** Abstract:** The crosslingual link detection problem calls for
identifying news articles in multiple languages that report on the same news
event. This paper presents a novel approach based on constrained clustering.
We discuss a general way for constrained clustering using a recent,
graph-based clustering framework called correlation clustering. We introduce
a correlation clustering implementation that features linear program chunking
to allow processing larger datasets. We show how to apply the correlation
clustering algorithm to the crosslingual link detection problem and present
experimental results that show correlation clustering improves upon the
hierarchical clustering approaches commonly used in link detection, and,
hierarchical clustering approaches that take constraints into account.

A. B. Goldberg, D. Andrzejewski, J. Van Gael, B. Settles, X. Zhu, and
M. Craven.
**Ranking
biomedical passages for relevance and diversity: University of Wisconsin,
Madison at TREC Genomics 2006**.
In *Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006)*,
Gaithersburg, MD, USA, November 2006.

** Abstract:** We report
on the University of Wisconsin, Madison's experience in the TREC Genomics
2006 track, which asks participants to retrieve passages from scientific
articles that satisfy biologists' information needs. An emphasis is placed on
returning relevant passages that discuss different aspects of the topic.
Using an off-the-shelf information retrieval (IR) engine, we focused on query
generation and reranking query results to encourage relevance and diversity.
For query generation, we automatically identify noun phrases from the topic
descriptions, and use online resources to gather synonyms as expansion terms.
Our first submission uses the baseline IR engine results. We rerank the
passages using a naive clustering-based approach in our second run, and we
test GRASSHOPPER, a novel graph-theoretic algorithm based on absorbing random
walks, in our third run. While our aspect-level results appear to compare
favorably with other participants on average, our query generation techniques
failed to produce adequate query results for several topics, causing our
passage and document-level evaluation scores to suffer. Furthermore, we
surprisingly achieved higher aspect-level scores using the initial ranking
than our methods aimed specifically at promoting diversity. While this sounds
discouraging, we have several ideas as to why this happened and hope to
produce new methods that correct these shortcomings.

Jan M Brauner, Sören Mindermann, Mrinank Sharma, David Johnston, John
Salvatier, Tomáš Gavenčiak, Anna B Stephenson, Gavin Leech,
George Altman, Vladimir Mikulik, Alexander John Norman, Joshua Teperowski
Monrad, Tamay Besiroglu, Hong Ge, Meghan A Hartwick, Yee Whye Teh, Leonid
Chindelevitch, Yarin Gal, and Jan Kulveit.
**Inferring the
effectiveness of government interventions against COVID-19**.
*Science*, December 2020.

Yingzhen Li and Yarin Gal.
**Dropout Inference
in Bayesian Neural Networks with Alpha-divergences**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** To obtain uncertainty
estimates with real-world Bayesian deep learning models, practical inference
approximations are needed. Dropout variational inference (VI) for example has
been used for machine vision and medical applications, but VI can severely
underestimates model uncertainty. Alpha-divergences are alternative
divergences to VI’s KL objective, which are able to avoid VI’s
uncertainty underestimation. But these are hard to use in practice: existing
techniques can only use Gaussian approximating distributions, and require
existing models to be changed radically, thus are of limited use for
practitioners. We propose a re-parametrisation of the alpha-divergence
objectives, deriving a simple inference technique which, together with
dropout, can be easily implemented with existing models by simply changing
the loss of the model. We demonstrate improved uncertainty estimates and
accuracy compared to VI in dropout networks. We study our model’s epistemic
uncertainty far away from the data using adversarial images, showing that
these can be distinguished from non-adversarial images by examining our
model’s uncertainty.

Rowan McAllister, Yarin Gal, Alex Kendall, Mark van der Wilk, Amar Shah,
Roberto Cipolla, and Adrian Weller.
**Concrete problems
for autonomous vehicle safety: Advantages of Bayesian deep
learning,**.
In *International Joint Conference on Artificial Intelligence*,
Melbourne, Australia, August 2017.

** Abstract:** Autonomous
vehicle (AV) software is typically composed of a pipeline of individual
components, linking sensor inputs to motor outputs. Erroneous component
outputs propagate downstream, hence safe AV software must consider the
ultimate effect of each component's errors. Further, improving safety alone
is not sufficient. Passengers must also feel safe to trust and use AV
systems. To address such concerns, we investigate three under-explored themes
for AV research: safety, interpretability, and compliance. Safety can be
improved by quantifying the uncertainties of component outputs and
propagating them forward through the pipeline. Interpretability is concerned
with explaining what the AV observes and why it makes the decisions it does,
building reassurance with the passenger. Compliance refers to maintaining
some control for the passenger. We discuss open challenges for research
within these themes. We highlight the need for concrete evaluation metrics,
propose example problems, and highlight possible solutions.

Yarin Gal, Jiri Hron, and Alex Kendall.
**Concrete dropout**.
*NeurIPS*, 2017.

** Abstract:** Dropout is used as a
practical tool to obtain uncertainty estimates in large vision models and
reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty
estimates, a grid-search over the dropout probabilities is necessary—a
prohibitive operation with large models, and an impossible one with RL. We
propose a new dropout variant which gives improved performance and better
calibrated uncertainties. Relying on recent developments in Bayesian deep
learning, we use a continuous relaxation of dropout's discrete masks.
Together with a principled optimisation objective, this allows for automatic
tuning of the dropout probability in large models, and as a result faster
experimentation cycles. In RL this allows the agent to adapt its uncertainty
dynamically as more data is observed. We analyse the proposed variant
extensively on a range of tasks, and give insights into common practice in
the field where larger dropout probabilities are often used in deeper model
layers.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Yarin Gal and Richard Turner.
**Improving the
Gaussian process sparse spectrum approximation by representing uncertainty
in frequency inputs**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 655-664, 2015.

** Abstract:** Standard
sparse pseudo-input approximations to the Gaussian process (GP) cannot handle
complex functions well. Sparse spectrum alternatives attempt to answer this
but are known to over-fit. We suggest the use of variational inference for
the sparse spectrum approximation to avoid both issues. We model the
covariance function with a finite Fourier series approximation and treat it
as a random variable. The random covariance function has a posterior, on
which a variational distribution is placed. The variational distribution
transforms the random covariance function to fit the data. We study the
properties of our approximate inference, compare it to alternative ones, and
extend it to the distributed and stochastic domains. Our approximation
captures complex functions better than standard approaches and avoids
over-fitting.

Yarin Gal and Zoubin Ghahramani.
**Pitfalls in the
use of parallel inference for the Dirichlet process**.
In *Proceedings of the 31th International Conference on Machine Learning
(ICML-14)*, 2014.

** Abstract:** Recent work done by
Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013)
has suggested an alternative parametrisation for the Dirichlet process in
order to derive non-approximate parallel MCMC inference for it – work which
has been picked-up and implemented in several different fields. In this paper
we show that the approach suggested is impractical due to an extremely
unbalanced distribution of the data. We characterise the requirements of
efficient parallel inference for the Dirichlet process and show that the
proposed inference fails most of these requirements (while approximate
approaches often satisfy most of them). We present both theoretical and
experimental evidence, analysing the load balance for the inference and
showing that it is independent of the size of the dataset and the number of
nodes available in the parallel implementation. We end with suggestions of
alternative paths of research for efficient non-approximate parallel
inference for the Dirichlet process.

Yarin Gal, Mark van der Wilk, and Carl Rasmussen.
**Distributed
variational inference in sparse Gaussian process regression and latent
variable models**.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 27*, pages
3257-3265. Curran Associates, Inc., 2014.

** Abstract:**
Gaussian processes (GPs) are a powerful tool for probabilistic inference over
functions. They have been applied to both regression and non-linear
dimensionality reduction, and offer desirable properties such as uncertainty
estimates, robustness to over-fitting, and principled ways for tuning
hyper-parameters. However the scalability of these models to big datasets
remains an active topic of research. We introduce a novel re-parametrisation
of variational inference for sparse GP regression and latent variable models
that allows for an efficient distributed algorithm. This is done by
exploiting the decoupling of the data given the inducing points to
re-formulate the evidence lower bound in a Map-Reduce setting. We show that
the inference scales well with data and computational resources, while
preserving a balanced distribution of the load among the nodes. We further
demonstrate the utility in scaling Gaussian processes to big data. We show
that GP performance improves with increasing amounts of data in regression
(on flight data with 2 million records) and latent variable modelling (on
MNIST). The results show that GPs perform better than many common models
often used for big data.

Yarin Gal and Phil Blunsom.
**A systematic
Bayesian treatment of the IBM alignment models**.
In *Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies*.
Association for Computational Linguistics, 2013.

**
Abstract:** The dominant yet ageing IBM and HMM word alignment models
underpin most popular Statistical Machine Translation implementations in use
today. Though beset by the limitations of implausible independence
assumptions, intractable optimisation problems, and an excess of tunable
parameters, these models provide a scalable and reliable starting point for
inducing translation systems. In this paper we build upon this venerable base
by recasting these models in the non-parametric Bayesian framework. By
replacing the categorical distributions at their core with hierarchical
Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we
provide a more flexible formulation and sidestep the original heuristic
optimisation techniques. The resulting models are highly extendible,
naturally permitting the introduction of phrasal dependencies. We present
extensive experimental results showing improvements in both AER and BLEU when
benchmarked against Giza++, including significant improvements over IBM model
4.

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel,
Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, and Laurence
Aitchison.
**Bayesian neural network
priors revisited**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Isotropic Gaussian priors are the de facto
standard for modern Bayesian neural network inference. However, it is unclear
whether these priors accurately reflect our true beliefs about the weight
distributions or give optimal performance. To find better priors, we study
summary statistics of neural network weights in networks trained using
stochastic gradient descent (SGD). We find that convolutional neural network
(CNN) and ResNet weights display strong spatial correlations, while fully
connected networks (FCNNs) display heavy-tailed weight distributions. We show
that building these observations into priors can lead to improved performance
on a variety of image classification datasets. Surprisingly, these priors
mitigate the cold posterior effect in FCNNs, but slightly increase the cold
posterior effect in ResNets.

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, and Mark van der
Wilk.
**Understanding
variational inference in function-space**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Recent work has attempted to directly
approximate the ‘function-space’ or predictive posterior distribution of
Bayesian models, without approximating the posterior distribution over the
parameters. This is appealing in e.g. Bayesian neural networks, where we only
need the former, and the latter is hard to represent. In this work, we
highlight some advantages and limitations of employing the Kullback-Leibler
divergence in this setting. For example, we show that minimizing the KL
divergence between a wide class of parametric distributions and the posterior
induced by a (non-degenerate) Gaussian process prior leads to an ill-defined
objective function. Then, we propose (featurized) Bayesian linear regression
as a benchmark for ‘function-space’ inference methods that directly
measures approximation quality. We apply this methodology to assess aspects
of the objective function and inference scheme considered in Sun et al.
(2018), emphasizing the quality of approximation to Bayesian inference as
opposed to predictive performance.

Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison.
**Deep convolutional networks as
shallow Gaussian processes**.
In *International Conference on Learning Representations (ICLR)*,
2019.

** Abstract:** We show that the output of a (residual)
convolutional neural network (CNN) with an appropriate prior over the weights
and biases is a Gaussian process (GP) in the limit of infinitely many
convolutional filters, extending similar results for dense networks. For a
CNN, the equivalent kernel can be computed exactly and, unlike "deep
kernels", has very few parameters: only the hyperparameters of the original
CNN. Further, we show that this kernel has two properties that allow it to be
computed efficiently; the cost of evaluating the kernel for a pair of images
is similar to a single forward pass through the original CNN with only one
filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84%
classification error on MNIST, a new record for GPs with a comparable number
of parameters.

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend
Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc
C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes,
and Sylvia Richardson.
**Interoperability of
statistical models in pandemic preparedness: principles and reality**.
*Stat. Sci.*, 37(2):183-206, May 2022.

** Abstract:** We
present interoperability as a guiding framework for statistical modelling to
assist policy makers asking multiple questions using diverse datasets in the
face of an evolving pandemic response. Interoperability provides an important
set of principles for future pandemic preparedness, through the joint design
and deployment of adaptable systems of statistical models for disease
surveillance using probabilistic reasoning. We illustrate this through case
studies for inferring and characterising spatial-temporal prevalence and
reproduction numbers of SARS-CoV-2 infections in England.

Adrian Goldwaser and Hong Ge.
**Learning deep neural networks
through iterative linearisation**.
In *Neurips 2022 Workshop Optimisation in Machine Learning*, 2022.

** Abstract:** The excellent real-world performance of deep
neural networks has received increasing attention. Despite the capacity to
overfit significantly, such large models work better than smaller ones. This
phenomenon is often referred to as the scaling law by practitioners. It is of
fundamental interest to study why the scaling law exists and how it
avoids/controls overfitting. One approach has been looking at infinite width
limits of neural networks (e.g., Neural Tangent Kernels, Gaussian Processes);
however, in practise, these do not fully explain finite networks as their
infinite counterparts do not learn features. Furthermore, the empirical
kernel for finite networks (i.e., the inner product of feature vectors),
changes significantly during training in contrast to infinite width networks.
In this work we derive an iterative linearised training method. We justify
iterative lineralisation as an interpolation between finite analogs of the
infinite width regime, which do not learn features, and standard gradient
descent training which does. We show some preliminary results where iterative
linearised training works well, noting in particular how much feature
learning is required to achieve comparable performance. We also provide novel
insights into the training behaviour of neural networks.

Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian
Tang.
**Pre-training molecular graph representation with 3d geometry**.
In *International Conference on Learning Representations*, 2022.

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der
Wilk, Carl Edward Rasmussen, and Hong Ge.
**Numerically stable sparse
Gaussian processes via minimum separation using cover trees**.
*arXiv*, 2022.

** Abstract:** As Gaussian processes
mature, they are increasingly being deployed as part of larger machine
learning and decision-making systems, for instance in geospatial modeling,
Bayesian optimization, or in latent Gaussian models. Within a system, the
Gaussian process model needs to perform in a stable and reliable manner to
ensure it interacts correctly with other parts the system. In this work, we
study the numerical stability of scalable sparse approximations based on
inducing points. We derive sufficient and in certain cases necessary
conditions on the inducing points for the computations performed to be
numerically stable. For low-dimensional tasks such as geospatial modeling, we
propose an automated method for computing inducing points satisfying these
conditions. This is done via a modification of the cover tree data structure,
which is of independent interest. We additionally propose an alternative
sparse approximation for regression with a Gaussian likelihood which trades
off a small amount of performance to further improve stability. We evaluate
the proposed techniques on a number of examples, showing that, in geospatial
settings, sparse approximations with guaranteed numerical stability often
perform comparably to those without.

Kai Xu, Tor Erlend Fjelde, Charles Sutton, and Hong Ge.
**Couplings for
multinomial Hamiltonian Monte Carlo**.
130:3646-3654, 13-15 Apr 2021.

** Abstract:** Hamiltonian
Monte Carlo (HMC) is a popular sampling method in Bayesian inference.
Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for
unbiased Monte Carlo estimation, establishing a generic parallelizable scheme
for HMC. However, in practice a different HMC method, multinomial HMC, is
considered as the go-to method, e.g. as part of the no-U-turn sampler. In
multinomial HMC, proposed states are not limited to end-points as in
Metropolis HMC; instead points along the entire trajectory can be proposed.
In this paper, we establish couplings for multinomial HMC, based on optimal
transport for multinomial sampling in its transition. We prove an upper bound
for the meeting time – the time it takes for the coupled chains to meet –
based on the notion of local contractivity. We evaluate our methods using
three targets: 1,000 dimensional Gaussians, logistic regression and
log-Gaussian Cox point processes. Compared to Heng & Jacob (2019),
coupled multinomial HMC generally attains a smaller meeting time, and is more
robust to choices of step sizes and trajectory lengths, which allows re-use
of existing adaptation methods for HMC. These improvements together paves the
way for a wider and more practical use of coupled HMC methods.

Jan M Brauner, Sören Mindermann, Mrinank Sharma, David Johnston, John
Salvatier, Tomáš Gavenčiak, Anna B Stephenson, Gavin Leech,
George Altman, Vladimir Mikulik, Alexander John Norman, Joshua Teperowski
Monrad, Tamay Besiroglu, Hong Ge, Meghan A Hartwick, Yee Whye Teh, Leonid
Chindelevitch, Yarin Gal, and Jan Kulveit.
**Inferring the
effectiveness of government interventions against COVID-19**.
*Science*, December 2020.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Hong Ge, Kai Xu, and Zoubin Ghahramani.
**Turing: A language
for flexible probabilistic inference**.
84:1682-1690, 09-11 Apr 2018.

** Abstract:** Probabilistic
programming promises to simplify and democratize probabilistic machine
learning, but successful probabilistic programming systems require flexible,
generic and efficient inference engines. In this work, we present a system
called Turing for building MCMC algorithms for probabilistic programming
inference. Turing has a very simple syntax and makes full use of the
numerical capabilities in the Julia programming language, including all
implemented probability distributions, and automatic differentiation. Turing
supports a wide range of popular Monte Carlo algorithms, including
Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling,
sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers.
Most importantly, Turing inference is composable: it combines MCMC operations
on subsets of variables, for example using a combination of an HMC engine and
a particle Gibbs (PG) engine. We explore several combinations of inference
methods with the aim of finding approaches that are both efficient and
universal, i.e. applicable to arbitrary probabilistic models. NUTS—a
popular variant of HMC that adapts Hamiltonian simulation path length
automatically, although quite powerful for exploring differentiable target
distributions, is however not universal. We identify some failure modes for
the NUTS engine, and demonstrate that composition of PG (for discrete
variables) and NUTS (for continuous variables) can be useful when the NUTS
engine is either not applicable, or simply does not work well. Our aim is to
present Turing and its composable inference engines to the world and
encourage other researchers to build on this system to help advance the field
of probabilistic machine learning.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani.
**Distributed
Inference for Dirichlet Process Mixture Models**.
37:2276-2284, 07-09 Jul 2015.

** Abstract:** Bayesian
nonparametric mixture models based on the Dirichlet process (DP) have been
widely used for solving problems like clustering, density estimation and
topic modelling. These models make weak assumptions about the underlying
process that generated the observed data. Thus, when more data are collected,
the complexity of these models can change accordingly. These theoretical
properties often lead to superior predictive performance when compared to
traditional finite mixture models. However, despite the increasing amount of
data available, the application of Bayesian nonparametric mixture models is
so far limited to relatively small data sets. In this paper, we propose an
efficient distributed inference algorithm for the DP and the HDP mixture
model. The proposed method is based on a variant of the slice sampler for
DPs. Since this sampler does not involve a pre-determined truncation, the
stationary distribution of the sampling algorithm is unbiased. We provide
both local thread-level and distributed machine-level parallel
implementations and study the performance of this sampler through an
extensive set of experiments on image and text data. When compared to
existing inference algorithms, the proposed method exhibits state-of-the-art
accuracy and strong scalability with up to 512 cores.

Francesco Iorio, Timothy Rittman, Hong Ge, Michael Menden, and Julio
Saez-Rodriguez.
**Transcriptional data: a
new gateway to drug repositioning?**.
*Drug Discovery Today*, 18(7-8):350-357, April 2013.

Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, and Fergus Simpson.
**Neural diffusion
processes**.
In *arXiv*, Online, Apr 2022.

** Abstract:** Gaussian
processes provide an elegant framework for specifying prior and posterior
distributions over functions. They are, however, also computationally
expensive, and limited by the expressivity of their covariance function. We
propose Neural Diffusion Processes (NDPs), a novel approach based upon
diffusion models, that learn to sample from distributions over functions.
Using a novel attention block, we can incorporate properties of stochastic
processes, such as exchangeability, directly into the NDP's architecture. We
empirically show that NDPs are able to capture functional distributions that
are close to the true Bayesian posterior of a Gaussian process. This enables
a variety of downstream tasks, including hyperparameter marginalisation and
Bayesian optimisation.

Vincent Dutordoir, James Hensman, Mark van der Wilk, Carl Henrik Ek, Zoubin
Ghahramani, and Nicolas Durrande.
**Deep
neural networks as point estimates for deep Gaussian processes**.
In *Advances in Neural Information Processing Systems 34*, Online, Dec
2021.

** Abstract:** Neural networks and Gaussian processes
are complementary in their strengths and weaknesses. Having a better
understanding of their relationship comes with the promise to make each
method benefit from the strengths of the other. In this work, we establish an
equivalence between the forward passes of neural networks and (deep) sparse
Gaussian process models. The theory we develop is based on interpreting
activation functions as interdomain inducing features through a rigorous
analysis of the interplay between activation functions and kernels. This
results in models that can either be seen as neural networks with improved
uncertainty prediction or deep Gaussian processes with increased prediction
accuracy. These claims are supported by experimental results on regression
and classification datasets.

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina,
Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani.
**Einsum networks: Fast and
scalable learning of tractable probabilistic circuits**.
In *37th International Conference on Machine Learning*, Online, July
2020.

** Abstract:** Probabilistic circuits (PCs) are a
promising avenue for probabilistic modeling, as they permit a wide range of
exact and efficient inference routines. Recent ``deep-learning-style''
implementations of PCs strive for a better scalability, but are still
difficult to train on real-world data, due to their sparsely connected
computational graphs. In this paper, we propose Einsum Networks (EiNets), a
novel implementation design for PCs, improving prior art in several regards.
At their core, EiNets combine a large number of arithmetic operations in a
single monolithic einsum-operation, leading to speedups and memory savings of
up to two orders of magnitude, in comparison to previous implementations. As
an algorithmic contribution, we show that the implementation of
Expectation-Maximization (EM) can be simplified for PCs, by leveraging
automatic differentiation. Furthermore, we demonstrate that EiNets scale well
to datasets which were previously out of reach, such as SVHN and CelebA, and
that they can be used as faithful generative image models.

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and Zoubin Ghahramani.
**Bayesian
learning of sum-product networks**.
In *Advances in Neural Information Processing Systems 33*, Vancouver,
December 2019.

** Abstract:** Sum-product networks (SPNs) are
flexible density estimators and have received significant attention due to
their attractive inference properties. While parameter learning in SPNs is
well developed, structure learning leaves something to be desired: Even
though there is a plethora of SPN structure learners, most of them are
somewhat ad-hoc and based on intuition rather than a clear learning
principle. In this paper, we introduce a well-principled Bayesian framework
for SPN structure learning. First, we decompose the problem into i) laying
out a computational graph, and ii) learning the so-called scope function over
the graph. The first is rather unproblematic and akin to neural network
architecture validation. The second represents the effective structure of the
SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic
Bayesian posterior inference. In various experiments, our Bayesian SPNs often
improve test likelihoods over greedy SPN learners. Further, since the
Bayesian framework protects against overfitting, we can evaluate
hyper-parameters directly on the Bayesian model score, waiving the need for a
separate validation set, which is especially beneficial in low data regimes.
Bayesian SPNs can be applied to heterogeneous domains and can easily be
extended to nonparametric formulations. Moreover, our Bayesian approach is
the first, which consistently and robustly learns SPN structures under
missing data.

Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller.
**One-network
adversarial fairness**.
In *33rd AAAI Conference on Artificial Intelligence*, Hawaii, January
2019.

** Abstract:** There is currently a great expansion of
the impact of machine learning algorithms on our lives, prompting the need
for objectives other than pure performance, including fairness. Fairness here
means that the outcome of an automated decision-making system should not
discriminate between subgroups characterized by sensitive attributes such as
gender or race. Given any existing differentiable classifier, we make only
slight adjustments to the architecture including adding a new hidden layer,
in order to enable the concurrent adversarial optimization for fairness and
accuracy. Our framework provides one way to quantify the tradeoff between
fairness and accuracy, while also leading to strong empirical
performance.

Tameem Adel, Zoubin Ghahramani, and Adrian Weller.
**Discovering
interpretable representations for both deep generative and discriminative
models**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Interpretability of
representations in both deep generative and discriminative models is highly
desirable. Current methods jointly optimize an objective combining accuracy
and interpretability. However, this may reduce accuracy, and is not
applicable to already trained models. We propose two interpretability
frameworks. First, we provide an interpretable lens for an existing model. We
use a generative model which takes as input the representation in an existing
(generative or discriminative) model, weakly supervised by limited side
information. Applying a flexible and invertible transformation to the input
leads to an interpretable representation with no loss in accuracy. We extend
the approach using an active learning strategy to choose the most useful side
information to obtain, allowing a human to guide what ``interpretable" means.
Our second framework relies on joint optimization for a representation which
is both maximally informative about the side information and maximally
compressive about the non-interpretable data factors. This leads to a novel
perspective on the relationship between compression and regularization. We
also propose a new interpretability evaluation metric based on our framework.
Empirically, we achieve state-of-the-art results on three datasets using the
two proposed algorithms.

Hong Ge, Kai Xu, and Zoubin Ghahramani.
**Turing: A language
for flexible probabilistic inference**.
84:1682-1690, 09-11 Apr 2018.

** Abstract:** Probabilistic
programming promises to simplify and democratize probabilistic machine
learning, but successful probabilistic programming systems require flexible,
generic and efficient inference engines. In this work, we present a system
called Turing for building MCMC algorithms for probabilistic programming
inference. Turing has a very simple syntax and makes full use of the
numerical capabilities in the Julia programming language, including all
implemented probability distributions, and automatic differentiation. Turing
supports a wide range of popular Monte Carlo algorithms, including
Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling,
sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers.
Most importantly, Turing inference is composable: it combines MCMC operations
on subsets of variables, for example using a combination of an HMC engine and
a particle Gibbs (PG) engine. We explore several combinations of inference
methods with the aim of finding approaches that are both efficient and
universal, i.e. applicable to arbitrary probabilistic models. NUTS—a
popular variant of HMC that adapts Hamiltonian simulation path length
automatically, although quite powerful for exploring differentiable target
distributions, is however not universal. We identify some failure modes for
the NUTS engine, and demonstrate that composition of PG (for discrete
variables) and NUTS (for continuous variables) can be useful when the NUTS
engine is either not applicable, or simply does not work well. Our aim is to
present Turing and its composable inference engines to the world and
encourage other researchers to build on this system to help advance the field
of probabilistic machine learning.

Jiri Hron, Alexander G. D. G. Matthews, and Zoubin Ghahramani.
**Variational Bayesian dropout:
pitfalls and fixes**.
*ICML*, 2018.

** Abstract:** Dropout, a stochastic
regularisation technique for training of neural networks, has recently been
reinterpreted as a specific type of approximate inference algorithm for
Bayesian neural networks. The main contribution of the reinterpretation is in
providing a theoretical framework useful for analysing and extending the
algorithm. We show that the proposed framework suffers from several issues;
from undefined or pathological behaviour of the true posterior related to use
of improper priors, to an ill-defined variational objective due to
singularity of the approximating distribution relative to the true posterior.
Our analysis of the improper log uniform prior used in variational Gaussian
dropout suggests the pathologies are generally irredeemable, and that the
algorithm still works only because the variational formulation annuls some of
the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL)
divergence, a new approximate inference objective for approximation of
high-dimensional distributions. We show that motivations for variational
Bernoulli dropout based on discretisation and noise have QKL as a limit.
Properties of QKL are studied both theoretically and on a simple practical
example which shows that the QKL-optimal approximation of a full rank
Gaussian with a degenerate one naturally leads to the Principal Component
Analysis solution.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Alexander G. D. G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and
Zoubin Ghahramani.
**Gaussian process behaviour in
wide deep neural networks**.
*ICLR*, 2018.

** Abstract:** Whilst deep neural networks
have shown great empirical success, there is still much work to be done to
understand their theoretical properties. In this paper, we study the
relationship between random, wide, fully connected, feedforward networks with
more than one hidden layer and Gaussian processes with a recursive kernel
definition. We show that, under broad conditions, as we make the architecture
increasingly wide, the implied random function converges in distribution to a
Gaussian process, formalising and extending existing results by Neal (1996)
to deep networks. To evaluate convergence rates empirically, we use maximum
mean discrepancy. We then compare finite Bayesian deep networks from the
literature to Gaussian processes in terms of the key predictive quantities of
interest, finding that in some cases the agreement can be very close. We
discuss the desirability of Gaussian process behaviour and review
non-Gaussian alternative models from the literature.

Adam Ścibior, Ohad Kammar, and Zoubin Ghahramani.
**Functional programming
for modular Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present an architectural design of a library
for Bayesian modelling and inference in modern functional programming
languages. The novel aspect of our approach are modular implementations of
existing state-of-the-art inference algorithms. Our design relies on three
inherently functional features: higher-order functions, inductive data-types,
and support for either type-classes or an expressive module system. We
provide a performant Haskell implementation of this architecture,
demonstrating that high-level and modular probabilistic programming can be
added as a library in sufficiently expressive languages. We review the core
abstractions in this architecture: inference representations, inference
transformations, and inference representation transformers. We then implement
concrete instances of these abstractions, counterparts to particle filters
and Metropolis-Hastings samplers, which form the basic building blocks of our
library. By composing these building blocks we obtain state-of-the-art
inference algorithms: Resample-Move Sequential Monte Carlo, Particle Marginal
Metropolis-Hastings, and Sequential Monte Carlo Squared. We evaluate our
implementation against existing probabilistic programming systems and find it
is already competitively performant, although we conjecture that existing
functional programming optimisation techniques could reduce the overhead
associated with the abstractions we use. We show that our modular design
enables deterministic testing of inherently stochastic Monte Carlo
algorithms. Finally, we demonstrate using OCaml that an expressive module
system can also implement our design.

Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang,
Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, and Zoubin
Ghahramani.
**Denotational
validation of higher-order Bayesian inference**.
*Proceedings of the ACM on Programming Languages*, 2, 2018.

** Abstract:** We present a modular semantic account of Bayesian
inference algorithms for probabilistic programming languages, as used in data
science and machine learning. Sophisticated inference algorithms are often
explained in terms of composition of smaller parts. However, neither their
theoretical justification nor their implementation reflects this modularity.
We show how to conceptualise and analyse such inference algorithms as
manipulating intermediate representations of probabilistic programs using
higher-order functions and inductive types, and their denotational semantics.
Semantic accounts of continuous distributions use measurable spaces. However,
our use of higher-order functions presents a substantial technical
difficulty: it is impossible to define a measurable space structure over the
collection of measurable functions between arbitrary measurable spaces that
is compatible with standard operations on those functions, such as function
application. We overcome this difficulty using quasi-Borel spaces, a recently
proposed mathematical structure that supports both function spaces and
continuous distributions. We define a class of semantic structures for
representing probabilistic programs, and semantic validity criteria for
transformations of these representations in terms of distribution
preservation. We develop a collection of building blocks for composing
representations. We use these building blocks to validate common inference
algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To
emphasize the connection between the semantic manipulation and its
traditional measure theoretic origins, we use Kock’s synthetic measure
theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to
the Metropolis-Hastings-Green theorem.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard
Schölkopf, and Sergey Levine.
**Interpolated policy gradient:
Merging on-policy and off-policy policy gradient estimation for deep
reinforcement learning**.
In *Advances in Neural Information Processing Systems 31*, Long Beach
USA, Dec 2017.

** Abstract:** Off-policy model-free deep
reinforcement learning methods using previously collected data can improve
sample efficiency over on-policy policy gradient techniques. On the other
hand, on-policy algorithms are often more stable and easier to use. This
paper examines, both theoretically and empirically, approaches to merging on-
and off-policy updates for deep reinforcement learning. Theoretical results
show that off-policy updates with a value function estimator can be
interpolated with on-policy policy gradient updates whilst still satisfying
performance bounds. Our analysis uses control variate methods to produce a
family of policy gradient algorithms, with several recently proposed
algorithms being special cases of this family. We then provide an empirical
comparison of these techniques with the remaining algorithmic details fixed,
and show how different mixing of off-policy gradient estimates with on-policy
samples contribute to improvements in empirical performance. The final
algorithm provides a generalization and unification of existing deep policy
gradient techniques, has theoretical guarantees on the bias introduced by
off-policy updates, and improves on the state-of-the-art model-free deep RL
methods on a number of OpenAI Gym continuous control benchmarks.

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, and Adrian Weller.
**Lost
relatives of the Gumbel trick**.
In *34th International Conference on Machine Learning*, Sydney,
Australia, August 2017.

** Abstract:** The Gumbel trick is a
method to sample from a discrete probability distribution, or to estimate its
normalizing partition function. The method relies on repeatedly applying a
random perturbation to the distribution in a particular way, each time
solving for the most likely configuration. We derive an entire family of
related methods, of which the Gumbel trick is one member, and show that the
new methods have superior properties in several settings with minimal
additional computational cost. In particular, for the Gumbel trick to yield
computational benefits for discrete graphical models, Gumbel perturbations on
all configurations are typically replaced with so-called low-rank
perturbations. We show how a subfamily of our new methods adapts to this
setting, proving new upper and lower bounds on the log partition function and
deriving a family of sequential samplers for the Gibbs distribution. Finally,
we balance the discussion by showing how the simpler analytical form of the
Gumbel trick enables additional theoretical results.

** Comment:** [arXiv] [Poster]
[Slides]
[Code]

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and
Sergey Levine.
**Q-prop: Sample-efficient policy
gradient with an off-policy critic**.
In *5th International Conference on Learning Representations*, Toulon
France, April 2017.

** Abstract:** Model-free deep
reinforcement learning (RL) methods have been successful in a wide variety of
simulated domains. However, a major obstacle facing deep RL in the real world
is their high sample complexity. Batch policy gradient methods offer stable
learning, but at the cost of high variance, which often requires large
batches. TD-style methods, such as off-policy actor-critic and Q-learning,
are more sample-efficient but biased, and often require costly hyperparameter
sweeps to stabilize. In this work, we aim to develop methods that combine the
stability of policy gradients with the efficiency of off-policy RL. We
present Q-Prop, a policy gradient method that uses a Taylor expansion of the
off-policy critic as a control variate. Q-Prop is both sample efficient and
stable, and effectively combines the benefits of on-policy and off-policy
methods. We analyze the connection between Q-Prop and existing model-free
algorithms, and use control variate theory to derive two variants of Q-Prop
with conservative and aggressive adaptation. We show that conservative Q-Prop
provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the
state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo
continuous control environments.

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, and Zoubin
Ghahramani.
**Sample-then-optimise
posterior sampling for Bayesian linear models**.
*AABI (NeurIPS workshop)*, 2017.

** Abstract:** In modern
machine learning it is common to train models which have an extremely high
intrinsic capacity. The results obtained are often i nitialization dependent,
are different for disparate optimizers and in some cases have no explicit
regularization. This raises difficult questions about generalization. A
natural approach to questions of generalization is a Bayesian one. There is
therefore a growing literature attempting to understand how Bayesian
posterior inference could emerge from the complexity of modern practice, even
without having such a procedure as the stated goal. In this work we consider
a simple special case where exact Bayesian posterior sampling emerges from
sampling (cf initialization) and then gradient descent. Specifically, for a
Bayesian linear model, if we parameterize it in terms of a deterministic
function of an isotropic normal prior, then the action of sampling from the
prior followed by first order optimization of the squared loss will give a
posterior sample. Although the assumptions are stronger than many real
problems, it still exhibits the challenging properties of redundant model
capacity and a lack of explicit regularizers, along with initialization and
optimizer dependence. It is therefore an interesting controlled test case.
Given its simplicity, the method itself may turn out to be of independent
interest from our original goal.

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, and Richard E. Turner.
**Magnetic
Hamiltonian Monte Carlo**.
In *34th International Conference on Machine Learning*, 2017.

** Abstract:** Hamiltonian Monte Carlo (HMC) exploits
Hamiltonian dynamics to construct efficient proposals for Markov chain Monte
Carlo (MCMC). In this paper, we present a generalization of HMC which
exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as
magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the
mechanics of a charged particle coupled to a magnetic field. We establish a
theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC,
and construct a symplectic, leapfrog-like integrator allowing for the
implementation of magnetic HMC. Finally, we exhibit several examples where
these non-canonical dynamics can lead to improved mixing of magnetic HMC
relative to ordinary HMC.

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and
Yee Whye Teh.
**The
Mondrian kernel**.
In *32nd Conference on Uncertainty in Artificial Intelligence*, pages
32-41, Jersey City, New Jersey, USA, June 2016.

**
Abstract:** We introduce the Mondrian kernel, a fast random feature
approximation to the Laplace kernel. It is suitable for both batch and online
learning, and admits a fast kernel-width-selection procedure as the random
features can be re-used efficiently for all kernel widths. The features are
constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and
we highlight the connection to Mondrian forests [Lakshminarayanan et al.,
2014], where trees are also sampled via a Mondrian process, but fit
independently. This link provides a new insight into the relationship between
kernel methods and random forests.

** Comment:** [Supplementary
Material] [arXiv] [Poster]
[Slides]
[Code]

Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg.
**Bayesian generalised ensemble Markov chain Monte Carlo**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:**
Bayesian generalised ensemble (BayesGE) is a new method that addresses two
major drawbacks of standard Markov chain Monte Carlo algorithms for inference
in high-dimensional probability models: inapplicability to estimate the
partition function, and poor mixing properties. BayesGE uses a Bayesian
approach to iteratively update the belief about the density of states
(distribution of the log likelihood under the prior) for the model, with the
dual purpose of enhancing the sampling efficiency and make the estimation of
the partition function tractable. We benchmark BayesGE on Ising and Potts
systems and show that it compares favourably to existing state-of-the-art
methods.

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

James Robert Lloyd and Zoubin Ghahramani.
**Statistical
model criticism using kernel two sample tests**.
In *Advances in Neural Information Processing Systems 29*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** We propose an
exploratory approach to statistical model criticism using maximum mean
discrepancy (MMD) two sample tests. Typical approaches to model criticism
require a practitioner to select a statistic by which to measure
discrepancies between data and a statistical model. MMD two sample tests are
instead constructed as an analytic maximisation over a large space of
possible statistics and therefore automatically select the statistic which
most shows any discrepancy. We demonstrate on synthetic data that the
selected statistic, called the witness function, can be used to identify
where a statistical model most misrepresents the data it was trained on. We
then apply the procedure to real data where the models being assessed are
restricted Boltzmann machines, deep belief networks and Gaussian process
regression and demonstrate the ways in which these models fail to capture the
properties of the data they are trained on.

Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani.
**Training
generative neural networks via Maximum Mean Discrepancy
optimization**.
In *31st Conference on Uncertainty in Artificial Intelligence*, pages
258-267, Amsterdam, The Netherlands, July 2015.

**
Abstract:** We consider training a deep neural network to generate samples
from an unknown distribution given i.i.d. data. We frame learning as an
optimization minimizing a two-sample test statistic-informally speaking, a
good generator network produces samples that cause a two-sample test to fail
to reject the null hypothesis. As our two-sample test statistic, we use an
unbiased estimate of the maximum mean discrepancy, which is the centerpiece
of the nonparametric kernel two-sample test proposed by Gretton et al.
(2012). We compare to the adversarial nets framework introduced by Goodfellow
et al. (2014), in which learning is a two-player game between a generator
network and an adversarial discriminator network, both trained to outwit the
other. From this perspective, the MMD statistic plays the role of the
discriminator. In addition to empirical comparisons, we prove bounds on the
generalization error incurred by optimizing the empirical MMD.

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Christian Steinruecken, Zoubin Ghahramani, and David MacKay.
**Improving PPM
with dynamic parameter updates**.
In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A.
Storer, editors, *Proceedings of the Data Compression Conference*.
IEEE Computer Society, April 2015.

** Abstract:** This article
makes several improvements to the classic PPM algorithm, resulting in a new
algorithm with superior compression effectiveness on human text. The key
differences of our algorithm to classic PPM are that (A) rather than the
original escape mechanism, we use a generalised blending method with explicit
hyper-parameters that control the way symbol counts are combined to form
predictions; (B) different hyper-parameters are used for classes of different
contexts; and (C) these hyper-parameters are updated dynamically using
gradient information. The resulting algorithm (PPM-DP) compresses human text
better than all currently published variants of PPM, CTW, DMC, LZ, CSE and
BWT, with runtime only slightly slower than classic PPM.

Yarin Gal, Yutian Chen, and Zoubin Ghahramani.
**Latent
Gaussian processes for distribution estimation of multivariate categorical
data**.
In *Proceedings of the 32nd International Conference on Machine Learning
(ICML-15)*, pages 645-654, 2015.

** Abstract:**
Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These
models embed observations in a continuous space to capture similarities
between them. Building on these ideas we propose a Bayesian model for the
unsupervised task of distribution estimation of multivariate categorical
data. We model vectors of categorical variables as generated from a
non-linear transformation of a continuous latent space. Non-linearity
captures multi-modality in the distribution. The continuous representation
addresses sparsity. Our model ties together many existing models, linking the
linear categorical latent Gaussian model, the Gaussian process latent
variable model, and Gaussian process classification. We derive inference for
our model based on recent developments in sampling based variational
inference. We show empirically that the model outperforms its linear and
discrete counterparts in imputation tasks of sparse data.

Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani.
**Distributed
Inference for Dirichlet Process Mixture Models**.
37:2276-2284, 07-09 Jul 2015.

** Abstract:** Bayesian
nonparametric mixture models based on the Dirichlet process (DP) have been
widely used for solving problems like clustering, density estimation and
topic modelling. These models make weak assumptions about the underlying
process that generated the observed data. Thus, when more data are collected,
the complexity of these models can change accordingly. These theoretical
properties often lead to superior predictive performance when compared to
traditional finite mixture models. However, despite the increasing amount of
data available, the application of Bayesian nonparametric mixture models is
so far limited to relatively small data sets. In this paper, we propose an
efficient distributed inference algorithm for the DP and the HDP mixture
model. The proposed method is based on a variant of the slice sampler for
DPs. Since this sampler does not involve a pre-determined truncation, the
stationary distribution of the sampling algorithm is unbiased. We provide
both local thread-level and distributed machine-level parallel
implementations and study the performance of this sampler through an
extensive set of experiments on image and text data. When compared to
existing inference algorithms, the proposed method exhibits state-of-the-art
accuracy and strong scalability with up to 512 cores.

Zoubin Ghahramani.
**Probabilistic
machine learning and artificial intelligence**.
*Nature*, 521:452–459, 2015, doi
doi:10.1038/nature14541.

** Abstract:** How can a machine
learn from experience? Probabilistic modelling provides a framework for
understanding what learning is, and has therefore emerged as one of the
principal theoretical and practical approaches for designing machines that
learn from data acquired through experience. The probabilistic framework,
which describes how to represent and manipulate uncertainty about models and
predictions, has a central role in scientific data analysis, machine
learning, robotics, cognitive science and artificial intelligence. This
Review provides an introduction to this framework, and discusses some of the
state-of-the-art advances in the field, namely, probabilistic programming,
Bayesian optimization, data compression and automatic model discovery.

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman,
Ryan P. Adams, and Zoubin Ghahramani.
**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

Adam Ścibior, Zoubin Ghahramani, and Andrew D. Gordon.
**Practical
probabilistic programming with monads**.
In *Proceedings of the 8th ACM SIGPLAN Symposium on Haskell*.
Association for Computing Machinery, 2015, doi
10.1145/2804302.2804317.

** Abstract:** The machine
learning community has recently shown a lot of interest in practical
probabilistic programming systems that target the problem of Bayesian
inference. Such systems come in different forms, but they all express
probabilistic models as computational processes using syntax resembling
programming languages. In the functional programming community monads are
known to offer a convenient and elegant abstraction for programming with
probability distributions, but their use is often limited to very simple
inference problems. We show that it is possible to use the monad abstraction
to construct probabilistic models for machine learning, while still offering
good performance of inference in challenging models. We use a GADT as an
underlying representation of a probability distribution and apply Sequential
Monte Carlo-based methods to achieve efficient inference. We define a formal
semantics via measure theory. We demonstrate a clean and elegant
implementation that achieves performance comparable with Anglican, a
state-of-the-art probabilistic programming system.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Alexander G. D. G Matthews, James Hensman, and Zoubin Ghahramani.
**Comparing
lower bounds on the entropy of mixture distributions for use in variational
inference**.
In *NIPS workshop on Advances in Variational Inference*,
Montreal, Canada, December 2014.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process volatility model**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** The prediction
of time-changing variances is an important task in the modeling of financial
data. Standard econometric models are often limited as they assume rigid
functional relationships for the evolution of the variance. Moreover,
functional parameters are usually learned by maximum likelihood, which can
lead to overfitting. To address these problems we introduce GP-Vol, a novel
non-parametric model for time-changing variances based on Gaussian Processes.
This new model can capture highly flexible functional relationships for the
variances. Furthermore, we introduce a new online algorithm for fast
inference in GP-Vol. This method is much faster than current offline
inference procedures and it avoids overfitting problems by following a fully
Bayesian approach. Experiments with financial data show that GP-Vol performs
significantly better than current standard alternatives.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A reversible
infinite hmm using normalised random measures**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a nonparametric prior
over reversible Markov chains. We use completely random measures,
specifically gamma processes, to construct a countably infinite graph with
weighted edges. By enforcing symmetry to make the edges undirected we define
a prior over random walks on graphs that results in a reversible Markov
chain. The resulting prior over infinite transition matrices is closely
related to the hierarchical Dirichlet process but enforces reversibility. A
reinforcement scheme has recently been proposed with similar properties, but
the de Finetti measure is not well characterised. We take the alternative
approach of explicitly constructing the mixing measure, which allows more
straightforward and efficient inference at the cost of no longer having a
closed form predictive distribution. We use our process to construct a
reversible infinite HMM which we apply to two real datasets, one from
epigenomics and one ion channel recording.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin
Ghahramani.
**Scalable
Gaussian Process structured prediction for grid factor graph
applications**.
In *31st International Conference on Machine Learning*, 2014.

** Abstract:** Structured prediction is an important and well
studied problem with many applications across machine learning. GPstruct is a
recently proposed structured prediction model that offers appealing
properties such as being kernelised, non-parametric, and supporting Bayesian
inference (Bratières et al. 2013). The model places a Gaussian process prior
over energy functions which describe relationships between input variables
and structured output variables. However, the memory demand of GPstruct is
quadratic in the number of latent variables and training runtime scales
cubically. This prevents GPstruct from being applied to problems involving
grid factor graphs, which are prevalent in computer vision and spatial
statistics applications. Here we explore a scalable approach to learning
GPstruct models based on ensemble learning, with weak learners (predictors)
trained on subsets of the latent variables and bootstrap data, which can
easily be distributed. We show experiments with 4M latent variables on image
segmentation. Our method outperforms widely-used conditional random field
models trained with pseudo-likelihood. Moreover, in image segmentation
problems it improves over recent state-of-the-art marginal optimisation
methods in terms of predictive performance and uncertainty calibration.
Finally, it generalises well on all training set sizes.

Alex Davies and Zoubin Ghahramani.
**The random forest
kernel and other kernels for big data from random partitions**.
*arXiv*, abs/1402.4293, 2014.

** Abstract:** We present
Random Partition Kernels, a new class of kernels derived by demonstrating a
natural connection between random partitions of objects and kernels between
those objects. We show how the construction can be used to create kernels
from methods that would not normally be viewed as random partitions, such as
Random Forest. To demonstrate the potential of this method, we propose two
new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show
that these kernels consistently outperform standard kernels on problems
involving real-world datasets. Finally, we show how the form of these kernels
lend themselves to a natural approximation that is appropriate for certain
big data problems, allowing O(N) inference in methods such as Gaussian
Processes, Support Vector Machines and Kernel PCA.

Yarin Gal and Zoubin Ghahramani.
**Pitfalls in the
use of parallel inference for the Dirichlet process**.
In *Proceedings of the 31th International Conference on Machine Learning
(ICML-14)*, 2014.

** Abstract:** Recent work done by
Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013)
has suggested an alternative parametrisation for the Dirichlet process in
order to derive non-approximate parallel MCMC inference for it – work which
has been picked-up and implemented in several different fields. In this paper
we show that the approach suggested is impractical due to an extremely
unbalanced distribution of the data. We characterise the requirements of
efficient parallel inference for the Dirichlet process and show that the
proposed inference fails most of these requirements (while approximate
approaches often satisfy most of them). We present both theoretical and
experimental evidence, analysing the load balance for the inference and
showing that it is independent of the size of the dataset and the number of
nodes available in the parallel implementation. We end with suggestions of
alternative paths of research for efficient non-approximate parallel
inference for the Dirichlet process.

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard
Schölkopf.
**Randomized
nonlinear component analysis**.
In *ICML*, volume 29 of *JMLR Proceedings*. JMLR.org,
2014.

** Abstract:** Classical techniques such as Principal
Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are
ubiquitous in statistics. However, these techniques only reveal linear
relationships in data. Although nonlinear variants of PCA and CCA have been
proposed, they are computationally prohibitive in the large scale. In a
separate strand of recent research, randomized methods have been proposed to
construct features that help reveal nonlinear patterns in data. For basic
tasks such as regression or classification, random features exhibit little or
no loss in performance, while achieving dramatic savings in computational
requirements. In this paper we leverage randomness to design scalable new
variants of nonlinear PCA and CCA; our ideas also extend to key multivariate
analysis tools such as spectral clustering or LDA. We demonstrate our
algorithms through experiments on real-world data, on which we compare
against the state-of-the-art. Code in R implementing our methods is provided
in the Appendix.

Alexander G. D. G. Matthews and Zoubin Ghahramani.
**Classification using log
Gaussian Cox processes**.
*arXiv preprint arXiv:1405.4141*, 2014.

** Abstract:**
McCullagh and Yang (2006) suggest a family of classification algorithms
based on Cox processes. We further investigate the log Gaussian variant which
has a number of appealing properties. Conditioned on the covariates, the
distribution over labels is given by a type of conditional Markov random
field. In the supervised case, computation of the predictive probability of a
single test point scales linearly with the number of training points and the
multiclass generalization is straightforward. We show new links between the
supervised method and classical nonparametric methods. We give a detailed
analysis of the pairwise graph representable Markov random field, which we
use to extend the model to semi-supervised learning problems, and propose an
inference method based on graph min-cuts. We give the first experimental
analysis on supervised and semi-supervised datasets and show good empirical
performance.

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani.
**Student-t
processes as alternatives to Gaussian processes**.
In *AISTATS*, JMLR Proceedings. JMLR.org, 2014.

**
Abstract:** We investigate the Student-t process as an alternative to the
Gaussian process as a nonparametric prior over functions. We derive closed
form expressions for the marginal likelihood and predictive distribution of a
Student-t process, by integrating away an inverse Wishart process prior over
the covariance kernel of a Gaussian process model. We show surprising
equivalences between different hierarchical Gaussian process models leading
to Student-t processes, and derive a new sampling scheme for the inverse
Wishart process, which helps elucidate these equivalences. Overall, we show
that a Student-t process can retain the attractive properties of a Gaussian
process - a nonparametric representation, analytic marginal and predictive
distributions, and easy model selection through covariance kernels - but has
enhanced flexibility, and predictive covariances that, unlike a Gaussian
process, explicitly depend on the values of training observations. We verify
empirically that a Student-t process is especially useful in situations where
there are changes in covariance structure, or in applications like Bayesian
optimization, where accurate predictive covariances are critical for good
performance. These advantages come at no additional computational cost over
Gaussian processes.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

Creighton Heaukulani and Zoubin Ghahramani.
**Dynamic
probabilistic models for latent feature propagation in social
networks**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Current Bayesian
models for dynamic social network data have focused on modelling the
influence of evolving unobserved structure on observed social interactions.
However, an understanding of how observed social relationships from the past
affect future unobserved structure in the network has been neglected. In this
paper, we introduce a new probabilistic model for capturing this phenomenon,
which we call latent feature propagation, in social networks. We demonstrate
our model's capability for inferring such latent structure in varying types
of social network datasets, and experimental studies show this structure
achieves higher predictive performance on link prediction and forecasting
tasks.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

Jacob Andreas and Zoubin Ghahramani.
**A generative model
of vector space semantics**.
*ACL 2013*, page 91, 2013.

** Abstract:** We present
a novel compositional, generative model for vector space representations of
meaning. This model reformulates earlier tensor-based approaches to vector
space semantics as a top-down process, and provides efficient algorithms for
transformation from natural language to vectors and from vectors to natural
language. We describe procedures for estimating the parameters of the model
from positive examples of similar phrases, and from distributional
representations, then use these procedures to obtain similarity judgments for
a set of adjective-noun pairs. The model’s estimation of the similarity of
these pairs correlates well with human annotations, demonstrating a
substantial improvement over several existing compositional approaches in
both settings.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani.
**Bayesian
structured prediction using Gaussian processes**.
*arXiv*, abs/1307.3846, 2013.

** Abstract:** We introduce
a conceptually novel structured prediction model, GPstruct, which is
kernelized, non-parametric and Bayesian, by design. We motivate the model
with respect to existing approaches, among others, conditional random fields
(CRFs), maximum margin Markov networks (M3N), and structured support vector
machines (SVMstruct), which embody only a subset of its properties. We
present an inference procedure based on Markov Chain Monte Carlo. The
framework can be instantiated for a wide range of structured objects such as
linear chains, trees, grids, and other general graphs. As a proof of concept,
the model is benchmarked on several natural language processing tasks and a
video gesture segmentation task involving a linear chain structure. We show
prediction accuracies for GPstruct which are comparable to or exceeding those
of CRFs and SVMstruct.

Konstantinos Bousmalis, Stefanos Zafeiriou, Louis-Philippe Morency, Maja
Pantic, and Zoubin Ghahramani.
**Variational
hidden conditional random fields with coupled Dirichlet process
mixtures**.
In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip
Zelezný, editors, *ECML/PKDD*, volume 8189 of *Lecture Notes in
Computer Science*, pages 531-547. Springer, 2013.

**
Abstract:** Hidden Conditional Random Fields (HCRFs) are discriminative
latent variable models which have been shown to successfully learn the hidden
structure of a given classification problem. An infinite HCRF is an HCRF with
a countably infinite number of hidden states, which rids us not only of the
necessity to specify a priori a fixed number of hidden states available but
also of the problem of overfitting. Markov chain Monte Carlo (MCMC) sampling
algorithms are often employed for inference in such models. However,
convergence of such algorithms is rather difficult to verify, and as the
complexity of the task at hand increases, the computational cost of such
algorithms often becomes prohibitive. These limitations can be overcome by
variational techniques. In this paper, we present a generalized framework for
infinite HCRF models, and a novel variational inference approach on a model
based on coupled Dirichlet Process Mixtures, the HCRF-DPM. We show that the
variational HCRF-DPM is able to converge to a correct number of represented
hidden states, and performs as well as the best parametric HCRFs -chosen
via cross-validation- for the difficult tasks of recognizing instances of
agreement, disagreement, and pain in audiovisual sequences.

Frederik Eaton and Zoubin Ghahramani.
**Model reductions
for inference: Generality of pairwise, binary, and planar factor
graphs**.
*Neural Computation*, 25(5):1213-1260, 2013.

**
Abstract:** We offer a solution to the problem of efficiently translating
algorithms between different types of discrete statistical model. We
investigate the expressive power of three classes of model-those with binary
variables, with pairwise factors, and with planar topology-as well as their
four intersections. We formalize a notion of "simple reduction" for the
problem of inferring marginal probabilities and consider whether it is
possible to "simply reduce" marginal inference from general discrete factor
graphs to factor graphs in each of these seven subclasses. We characterize
the reducibility of each class, showing in particular that the class of
binary pairwise factor graphs is able to simply reduce only positive models.
We also exhibit a continuous "spectral reduction" based on polynomial
interpolation, which overcomes this limitation. Experiments assess the
performance of standard approximate inference algorithms on the outputs of
our reductions.

Zoubin Ghahramani.
**Bayesian
nonparametrics and the probabilistic approach to modelling**.
*Philosophical Transactions of the Royal Society A*, 2013.

** Abstract:** Modelling is fundamental to many fields of
science and engineering. A model can be thought of as a representation of
possible data one could predict from a system. The probabilistic approach to
modelling uses probability theory to express all aspects of uncertainty in
the model. The probabilistic approach is synonymous with Bayesian modelling,
which simply uses the rules of probability theory in order to make
predictions, compare alternative models, and learn model parameters and
structure from data. This simple and elegant framework is most powerful when
coupled with flexible probabilistic models. Flexibility is achieved through
the use of Bayesian nonparametrics. This article provides an overview of
probabilistic modelling and an accessible survey of some of the main tools in
Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics
for modelling unknown functions, density estimation, clustering, time series
modelling, and representing sparsity, hierarchies, and covariance structure.
More specifically it gives brief non-technical overviews of Gaussian
processes, Dirichlet processes, infinite hidden Markov models, Indian buffet
processes, Kingman's coalescent, Dirichlet diffusion tress, and Wishart
processes.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani.
**Active
learning for interactive visualization**.
In *16th International Conference on Artificial Intelligence and
Statistics*, 2013.

** Abstract:** Many automatic
visualization methods have been proposed. However, a visualization that is
automatically generated might be different to how a user wants to arrange the
objects in visualization space. By allowing users to re-locate objects in the
embedding space of the visualization, they can adjust the visualization to
their preference. We propose an active learning framework for interactive
visualization which selects objects for the user to re-locate so that they
can obtain their desired visualization by re-locating as few as possible. The
framework is based on an information theoretic criterion, which favors
objects that reduce the uncertainty of the visualization. We present a
concrete application of the proposed framework to the Laplacian eigenmap
visualization method. We demonstrate experimentally that the proposed
framework yields the desired visualization with fewer user interactions than
existing methods.

Tomoharu Iwata, Amar Shah, and Zoubin Ghahramani.
**Discovering
latent influence in online social activities via shared cascade poisson
processes**.
In *KDD*, pages 266-274. Association for Computing Machinery, 2013.

** Abstract:** Many people share their activities with others
through online communities. These shared activities have an impact on other
users' activities. For example, users are likely to become interested in
items that are adopted (e.g. liked, bought and shared) by their friends. In
this paper, we propose a probabilistic model for discovering latent influence
from sequences of item adoption events. An inhomogeneous Poisson process is
used for modeling a sequence, in which adoption by a user triggers the
subsequent adoption of the same item by other users. For modeling adoption of
multiple items, we employ multiple inhomogeneous Poisson processes, which
share parameters, such as influence for each user and relations between
users. The proposed model can be used for finding influential users,
discovering relations between users and predicting item popularity in the
future. We present an efficient Bayesian inference procedure of the proposed
model based on the stochastic EM algorithm. The effectiveness of the proposed
model is demonstrated by using real data sets in a social bookmark sharing
service.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

Colorado Reed and Zoubin Ghahramani.
**Scaling the
Indian buffet process via submodular maximization**.
In *ICML*, volume 28 of *JMLR Proceedings*, pages
1013-1021. JMLR.org, 2013.

** Abstract:** Inference for
latent feature models is inherently difficult as the inference space grows
exponentially with the size of the input data and number of latent features.
In this work, we use Kurihara & Welling (2008)'s maximization-expectation
framework to perform approximate MAP inference for linear-Gaussian latent
feature models with an Indian Buffet Process (IBP) prior. This formulation
yields a submodular function of the features that corresponds to a lower
bound on the model evidence. By adding a constant to this function, we obtain
a nonnegative submodular function that can be maximized via a greedy
algorithm that obtains at least a one-third approximation to the optimal
solution. Our inference method scales linearly with the size of the input
data, and we show the efficacy of our method on the largest datasets
currently analyzed using an IBP model.

Amar Shah and Zoubin Ghahramani.
**Determinantal
clustering processes - A nonparametric Bayesian approach to kernel based
semi-supervised clustering**.
*UAI*, 2013.

** Abstract:** Semi-supervised clustering is
the task of clustering data points into clusters where only a fraction of the
points are labelled. The true number of clusters in the data is often unknown
and most models require this parameter as an input. Dirichlet process mixture
models are appealing as they can infer the number of clusters from the data.
However, these models do not deal with high dimensional data well and can
encounter difficulties in inference. We present a novel nonparameteric
Bayesian kernel based method to cluster data points without the need to
prespecify the number of clusters or to model complicated densities from
which data points are assumed to be generated from. The key insight is to use
determinants of submatrices of a kernel matrix as a measure of how close
together a set of points are. We explore some theoretical properties of the
model and derive a natural Gibbs based algorithm with MCMC hyperparameter
learning. The model is implemented on a variety of synthetic and real world
data sets.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

A. Bahramisharif, M. A. J. van Gerven, J-M. Schoffelen, Z. Ghahramani, and
T. Heskes.
**The dynamic
beamformer**.
In G. Langs et al, editor, *Machine Learning in Interpretation of
Neuroimaging (MLINI) 2011 LNAI 7263*, pages 148-155, 2012.

** Abstract:** Beamforming is one of the most commonly used
methods for estimating the active neural sources from the MEG or EEG sensor
readings. The basic assumption in beamforming is that the sources are
uncorrelated, which allows for estimating each source independent of the
others. In this paper, we incorporate the independence assumption of the
standard beamformer in a linear dynamical system, thereby introducing the
dynamic beamformer. Using empirical data, we show that the dynamic beamformer
outperforms the standard beamformer in predicting the condition of interest
which strongly suggests that it also outperforms the standard method in
localizing the active neural generators.

John P. Cunningham, Zoubin Ghahramani, and Carl Edward Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, pages 255-263, 2012.

** Abstract:** In many
settings, data is collected as multiple time series, where each recorded time
series is an observation of some underlying dynamical process of interest.
These observations are often time-marked with known event times, and one
desires to do a range of standard analyses. When there is only one time
marker, one simply aligns the observations temporally on that marker. When
multiple time-markers are present and are at different times on different
time series observations, these analyses are more difficult. We describe a
Gaussian Process model for analyzing multiple time series with multiple time
markings, and we test it on a variety of data.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Hyun-Chul Kim and Zoubin Ghahramani.
**Bayesian
classifier combination**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Bayesian model averaging
linearly mixes the probabilistic predictions of multiple models, each
weighted by its posterior probability. This is the coherent Bayesian way of
combining multiple models only under certain restrictive assumptions, which
we outline. We explore a general framework for Bayesian model combination
(which differs from model averaging) in the context of classification. This
framework explicitly models the relationship between each model’s output
and the unknown true label. The framework does not require that the models be
probabilistic (they can even be human assessors), that they share prior
information or receive the same training data, or that they be independent in
their errors. Finally, the Bayesian combiner does not need to believe any of
the models is in fact correct. We test several variants of this classifier
combination procedure starting from a classic statistical model proposed by
Dawid and Skene (1979) and using MCMC to add more complex but important
features to the model. Comparisons on sev- eral data sets to simpler methods
like majority voting show that the Bayesian methods not only perform well but
result in interpretable diagnostics on the data points and the models.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 2012.

** Abstract:** Motivation: The
integration of multiple datasets remains a key challenge in systems biology
and genomic medicine. Modern high-throughput technologies generate a broad
array of different data types, providing distinct – but often complementary
– information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured via parameters that
describe the agreement among the datasets.

Results: Using a set of 6
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real S. cerevisiae datasets. In the
2-dataset case, we show that MDI’s performance is comparable to the present
state of the art. We then move beyond the capabilities of current approaches
and integrate gene expression, ChIP-chip and protein-protein interaction
data, to identify a set of protein complexes for which genes are co-regulated
during the cell cycle. Comparisons to other unsupervised data integration
techniques – as well as to non-integrative approaches – demonstrate that
MDI is very competitive, while also providing information that would be
difficult or impossible to extract using other methods.

** Comment:** This paper is available from the Bioinformatics
site and a Matlab implementation of MDI is available fromthis site.

Paul D. W. Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, and
David L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 28(24):3290-3297, 2012.

** Abstract:**
MOTIVATION: The integration of multiple datasets remains a key challenge in
systems biology and genomic medicine. Modern high-throughput technologies
generate a broad array of different data types, providing distinct-but often
complementary-information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured through parameters
that describe the agreement among the datasets. RESULTS: Using a set of six
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real Saccharomyces cerevisiae
datasets. In the two-dataset case, we show that MDI's performance is
comparable with the present state-of-the-art. We then move beyond the
capabilities of current approaches and integrate gene expression, chromatin
immunoprecipitation-chip and protein-protein interaction data, to identify a
set of protein complexes for which genes are co-regulated during the cell
cycle. Comparisons to other unsupervised data integration techniques-as well
as to non-integrative approaches-demonstrate that MDI is competitive, while
also providing information that would be difficult or impossible to extract
using other methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Donglin Niu, Jennifer G. Dy, and Z. Ghahramani.
**A nonparametric
Bayesian model for multiple clustering with overlapping feature
views**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Most clustering
algorithms produce a single clustering solution. This is inadequate for many
data sets that are multi-faceted and can be grouped and interpreted in many
different ways. Moreover, for high-dimensional data, different features may
be relevant or irrelevant to each clustering solution, suggesting the need
for feature selection in clustering. Features relevant to one clustering
interpretation may be different from the ones relevant for an alternative
interpretation or view of the data. In this paper, we introduce a
probabilistic nonparametric Bayesian model that can discover multiple
clustering solutions from data and the feature subsets that are relevant for
the clusters in each view. In our model, the features in different views may
be shared and therefore the sets of relevant features are allowed to overlap.
We model feature relevance to each view using an Indian Buffet Process and
the cluster membership in each view using a Chinese Restaurant Process. We
provide an inference approach to learn the latent parameters corresponding to
this multiple partitioning problem. Our model not only learns the features
and clusters in each view but also automatically learns the number of
clusters, number of views and number of features in each view.

Barnabas Poczos, Zoubin Ghahramani, and Jeff Schneider.
**Copula-based
kernel dependency measures**.
In *29th International Conference on Machine Learning*, 2012.

** Abstract:** The paper presents a new copula based method for
measuring dependence between random variables. Our approach extends the
Maximum Mean Discrepancy to the copula of the joint distribution. We prove
that this approach has several advantageous properties. Similarly to Shannon
mutual information, the proposed dependence measure is invariant to any
strictly increasing transformation of the marginal variables. This is
important in many applications, for example in feature selection. The
estimator is consistent, robust to outliers, and uses rank statistics only.
We derive upper bounds on the convergence rate and propose independence tests
too. We illustrate the theoretical contributions through a series of
experiments in feature selection and low-dimensional embedding of
distributions.

Jacob Steinhardt and Zoubin Ghahramani.
**Flexible martingale
priors for deep hierarchies**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** When building priors
over trees for Bayesian hierarchical models, there is a tension between
maintaining desirable theoretical properties such as infinite exchangeability
and important practical properties such as the ability to increase the depth
of the tree to accommodate new data. We resolve this tension by presenting a
family of infinitely exchangeable priors over discrete tree structures that
allows the depth of the tree to grow with the data, and then showing that our
family contains all hierarchical models with certain mild symmetry
properties. We also show that deep hierarchical models are in general
intimately tied to a process called a martingale, and use Doob’s martingale
convergence theorem to demonstrate some unexpected properties of deep
hierarchies.

Kyung-Ah Sohn, Zoubin Ghahramani, and Eric P. Xing.
**Robust estimation
of local genetic ancestry in admixed populations using a non-parametric
Bayesian approach**.
*Genetics*, 191(4), 2012.

** Abstract:** We present a new
haplotype-based approach for inferring local genetic ancestry of individuals
in an admixed population. Most existing approaches for local ancestry
estimation ignore the latent genetic relatedness between ancestral
populations and treat them as independent. In this paper, we exploit such
information by building an inheritance model that describes both the
ancestral populations and the admixed population jointly in a unified
framework. Based on an assumption that the common hypothetical founder
haplotypes give rise to both the ancestral and admixed population haplotypes,
we employ an infinite hidden Markov model to characterize each ancestral
population and further extend it to generate the admixed population. Through
an effective utilization of the population structural information under a
principled nonparametric Bayesian framework, the resulting model is
significantly less sensitive to the choice and the amount of training data
for ancestral populations than state-of-the-arts algorithms. We also improve
the robustness under deviation from common modeling assumptions by
incorporating population-specific scale parameters that allow variable
recombination rates in different populations. Our method is applicable to an
admixed population from an arbitrary number of ancestral populations and also
performs competitively in terms of spurious ancestry proportions under
general multi-way admixture assumption. We validate the proposed method by
simulation under various admixing scenarios and present empirical analysis
results on worldwide distributed dataset from Human Genome Diversity
Project.

** Comment:** doi: 10.1534/genetics.112.140228

Andrew Gordon Wilson and Zoubin Ghahramani.
**Modelling input
varying correlations between multiple responses**.
In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors,
*ECML/PKDD*, volume 7524 of *Lecture Notes in Computer
Science*, pages 858-861. Springer, 2012.

** Abstract:**
We introduced a generalised Wishart process (GWP) for modelling input
dependent covariance matrices Σ(x), allowing one to model input varying
correlations and uncertainties between multiple response variables. The GWP
can naturally scale to thousands of response variables, as opposed to
competing multivariate volatility models which are typically intractable for
greater than 5 response variables. The GWP can also naturally capture a rich
class of covariance dynamics - periodicity, Brownian motion, smoothness,
- through a covariance kernel.

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, and Zoubin Ghahramani.
**Continuous
relaxations for discrete Hamiltonian Monte Carlo**.
In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges,
Léon Bottou, and Kilian Q. Weinberger, editors, *NIPS*, pages
3203-3211, 2012.

** Abstract:** Continuous relaxations play
an important role in discrete optimization, but have not seen much use in
approximate probabilistic inference. Here we show that a general form of the
Gaussian Integral Trick makes it possible to transform a wide class of
discrete variable undirected models into fully continuous systems. The
continuous representation allows the use of gradient-based Hamiltonian Monte
Carlo for inference, results in new ways of estimating normalization
constants (partition functions), and in general opens up a number of new
avenues for inference in difficult discrete systems. We demonstrate some of
these continuous relaxation inference algorithms on a number of illustrative
problems.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

A. Davies and Z. Ghahramani.
**Language-independent
Bayesian sentiment mining of twitter**.
In *In The Fifth Workshop on Social Network Mining and Analysis
(SNA-KDD 2011)*, August 2011.

** Abstract:** This paper
outlines a new language-independent model for sentiment analysis of short,
social-network statuses. We demonstrate this on data from Twitter, modelling
happy vs sad sentiment, and show that in some circumstances this outperforms
similar Naive Bayes models by more than 10%. We also propose an extension to
allow the modelling of differ- ent sentiment distributions in different
geographic regions, while incorporating information from neighbouring
regions. We outline the considerations when creating a system analysing
Twitter data and present a scalable system of data acquisi- tion and
prediction that can monitor the sentiment of tweets in real time.

Thomas L. Griffiths and Zoubin Ghahramani.
**The Indian buffet
process: An introduction and review**.
*Journal of Machine Learning Research*, 12:1185-1224, April 2011.

** Abstract:** The Indian buffet process is a stochastic process
defining a probability distribution over equivalence classes of sparse binary
matrices with a finite number of rows and an unbounded number of columns.
This distribution is suitable for use as a prior in probabilistic models that
represent objects using a potentially infinite array of features, or that
involve bipartite graphs in which the size of at least one class of nodes is
unknown. We give a detailed derivation of this distribution, and illustrate
its use as a prior in an infinite latent feature model. We then review recent
applications of the Indian buffet process in machine learning, discuss its
extensions, and summarize its connections to other stochastic processes.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths.
**Testing a
Bayesian measure of representativeness using a large image
database**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** How do people
determine which elements of a set are most representative of that set? We
extend an existing Bayesian measure of representativeness, which indicates
the representativeness of a sample from a distribution, to deﬁne a measure
of the representativeness of an item to a set. We show that this measure is
formally related to a machine learning method known as Bayesian Sets.
Building on this connection, we derive an analytic expression for the
representativeness of objects described by a sparse vector of binary
features. We then apply this measure to a large database of images, using it
to determine which images are the most representative members of different
sets. Comparing the resulting predictions to human judgments of
representativeness provides a test of this measure with naturalistic stimuli,
and illustrates how databases that are more commonly used in computer vision
and machine learning can be used to evaluate psychological theories.

Finale Doshi-Velez and Zoubin Ghahramani.
**A comparison of
human and agent reinforcement learning in partially observable
domains**.
In *33rd Annual Meeting of the Cognitive Science Society*, Boston, MA,
2011.

** Abstract:** It is commonly stated that reinforcement
learning (RL) algorithms learn slower than humans. In this work, we
investigate this claim using two standard problems from the RL literature. We
compare the performance of human subjects to RL techniques. We find that
context-the meaningfulness of the observations—-plays a significant role
in the rate of human RL. Moreover, without contextual information, humans
often fare much worse than classic algorithms. Comparing the detailed
responses of humans and RL algorithms, we also find that humans appear to
employ rather different strategies from standard algorithms, even in cases
where they had indistinguishable performance to them. Our research both sheds
light on human RL and provides insights for improving RL algorithms.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel.
**Bayesian active
learning for classification and preference learning**.
*arXiv*, abs/1112.5745, 2011.

** Abstract:** Information
theoretic active learning has been widely studied for probabilistic models.
For simple regression an optimal myopic policy is easily tractable. However,
for other tasks and with more complex models, such as classification with
nonparametric models, the optimal solution is harder to compute. Current
approaches make approximations to achieve tractability. We propose an
approach that expresses information gain in terms of predictive entropies,
and apply this method to the Gaussian Process Classifier (GPC). Our approach
makes minimal approximations to the full information theoretic objective. Our
experimental performance compares favourably to many popular active learning
algorithms, and has equal or lower computational complexity. We compare well
to decision theoretic approaches also, which are privy to more information
and require much more computational time. Secondly, by developing further a
reformulation of binary preference learning to a classification problem, we
extend our algorithm to Gaussian Process preference learning.

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

Ryan Turner, Steven Bottone, and Zoubin Ghahramani.
**Fast online
anomaly detection using scan statistics**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 385-390,
Kittilä, Finland, August 2010.

** Abstract:** We present
methods to do fast online anomaly detection using scan statistics. Scan
statistics have long been used to detect statistically significant bursts of
events. We extend the scan statistics framework to handle many practical
issues that occur in application: dealing with an unknown background rate of
events, allowing for slow natural changes in background frequency, the
inverse problem of finding an unusual lack of events, and setting the test
parameters to maximize power. We demonstrate its use on real and synthetic
data sets with comparison to other methods.

Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani.
**Variational
inference for nonparametric multiple clustering**.
In *KDD10 Workshop on Discovering, Summarizing, and Using Multiple
Clusterings*, Washington, DC, USA, July 2010.

**
Abstract:** Most clustering algorithms produce a single clustering
solution. Similarly, feature selection for clustering tries to find one
feature subset where one interesting clustering solution resides. However, a
single data set may be multi-faceted and can be grouped and interpreted in
many different ways, especially for high dimensional data, where feature
selection is typically needed. Moreover, different clustering solutions are
interesting for different purposes. Instead of committing to one clustering
solution, in this paper we introduce a probabilistic nonparametric Bayesian
model that can discover several possible clustering solutions and the feature
subset views that generated each cluster partitioning simultaneously. We
provide a variational inference approach to learn the features and clustering
partitions in each view. Our model allows us not only to learn the multiple
clusterings and views but also allows us to automatically learn the number of
views and the number of clusters in each view.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani.
**Traffic
classification in information poor environments**.
In *1st International Workshop on Traffic Analysis and Classification (IWCMC
'10)*, Caen, France, July 2010.

** Abstract:** Traffic
classification using machine learning continues to be an active research
area. The majority of work in this area uses *off-the-shelf* machine
learning tools and treats them as *black-box* classifiers. This approach
turns all the modelling complexity into a feature selection problem. In this
paper, we build a problem-specific solution to the traffic classification
problem by designing a custom probabilistic graphical model. Graphical models
are a modular framework to design classifiers which incorporate
domain-specific knowledge. More specifically, our solution introduces
semi-supervised learning which means we learn from both labelled and
unlabelled traffic flows. We show that our solution performs competitively
compared to previous approaches while using less data and simpler
features.

R. P. Adams, H. Wallach, and Zoubin Ghahramani.
**Learning the
structure of deep sparse graphical models**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, pages 1-8, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Deep belief
networks are a powerful way to model complex probability distributions.
However, it is difficult to learn the structure of a belief network,
particularly one with hidden units. The Indian buffet process has been used
as a nonparametric Bayesian prior on the structure of a directed belief
network with a single infinitely wide hidden layer. Here, we introduce the
cascading Indian buffet process (CIBP), which provides a prior on the
structure of a layered, directed belief network that is unbounded in both
depth and width, yet allows tractable inference. We use the CIBP prior with
the nonlinear Gaussian belief network framework to allow each unit to vary
its behavior between discrete and continuous representations. We use Markov
chain Monte Carlo for inference in this model and explore the structures
learned on image data.

** Comment:** Winner of the Best Paper Award

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani.
**Dependent
Indian buffet processes**.
In *13th International Conference on Artificial Intelligence and
Statistics*, volume 9 of *W & CP*, pages 924-931, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Latent
variable models represent hidden structure in observational data. To account
for the distribution of the observational data changing over time, space or
some other covariate, we need generalizations of latent variable models that
explicitly capture this dependency on the covariate. A variety of such
generalizations has been proposed for latent variable models based on the
Dirichlet process. We address dependency on covariates in binary latent
feature models, by introducing a dependent Indian Buffet Process. The model
generates a binary random matrix with an unbounded number of columns for each
value of the covariate. Evolution of the binary matrices over the covariate
set is controlled by a hierarchical Gaussian process model. The choice of
covariance functions controls the dependence structure and exchangeability
properties of the model. We derive a Markov Chain Monte Carlo sampling
algorithm for Bayesian inference, and provide experiments on both synthetic
and real-world data. The experimental results show that explicit modeling of
dependencies significantly improves accuracy of predictions.

R. P. Adams, Zoubin Ghahramani, and Michael I. Jordan.
**Tree-structured
stick breaking for hierarchical data**.
In *Advances in Neural Information Processing Systems 23*. The MIT
Press, 2010.

** Abstract:** Many data are naturally modeled by
an unobserved hierarchical structure. In this paper we propose a flexible
nonparametric prior over unknown data hierarchies. The approach uses nested
stick-breaking processes to allow for trees of unbounded width and depth,
where data can live at any node and are infinitely exchangeable. One can view
our model as providing infinite mixtures where the components have a
dependency structure corresponding to an evolutionary diffusion down a tree.
By using a stick-breaking approach, we can apply Markov chain Monte Carlo
methods based on slice sampling to perform Bayesian inference and simulate
from the posterior distribution on trees. We apply our method to hierarchical
clustering of images and topic modeling of text data.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani.
**Kronecker
graphs: An approach to modeling networks**.
*Journal of Machine Learning Research*, 11(Feb):985-1042, 2010.

** Abstract:** How can we generate realistic networks? In
addition, how can we do so with a mathematically tractable model that allows
for rigorous analysis of network properties? Real networks exhibit a long
list of surprising properties: Heavy tails for the in- and out-degree
distribution, heavy tails for the eigenvalues and eigenvectors, small
diameters, and densification and shrinking diameters over time. Current
network models and generators either fail to match several of the above
properties, are complicated to analyze mathematically, or both. Here we
propose a generative model for networks that is both mathematically tractable
and can generate networks that have all the above mentioned structural
properties. Our main idea here is to use a non-standard matrix operation, the
Kronecker product, to generate graphs which we refer to as "Kronecker
graphs".

First, we show that Kronecker graphs naturally obey common
network properties. In fact, we rigorously prove that they do so. We also
provide empirical evidence showing that Kronecker graphs can effectively
model the structure of real networks.

We then present KRONFIT, a fast and
scalable algorithm for fitting the Kronecker graph generation model to large
real networks. A naive approach to fitting would take super-exponential time.
In contrast, KRONFIT takes linear time, by exploiting the structure of
Kronecker matrix multiplication and by using statistical simulation
techniques. Experiments on a wide range of large real and synthetic networks
show that KRONFIT finds accurate parameters that very well mimic the
properties of target networks. In fact, using just four parameters we can
accurately model several aspects of global network structure. Once fitted,
the model parameters can be used to gain insights about the network
structure, and the resulting synthetic graphs can be used for null-models,
anonymization, extrapolations, and graph summarization.

C. Lippert, Z. Ghahramani, and K. Borgwardt.
**Gene function
prediction from synthetic lethality networks via ranking on demand**.
*Bioinformatics*, 26:912-918, 2010.

** Abstract:**
Motivation: Synthetic lethal interactions represent pairs of genes whose
individual mutations are not lethal, while the double mutation of both genes
does incur lethality. Several studies have shown a correlation between
functional similarity of genes and their distances in networks based on
synthetic lethal interactions. However, there is a lack of algorithms for
predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for
gene function prediction from synthetic lethal interaction networks based on
kernel machines. We apply our novel algorithm to Gene Ontology functional
annotation prediction in yeast. Our experiments show that our method leads to
improved gene function prediction compared with state-of-the-art competitors
and that combining genetic and congruence networks leads to a further
improvement in prediction accuracy.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani.
**Probabilistic
graphical models for semi-supervised traffic classification**.
In *The 6th International Wireless Communications and Mobile Computing
Conference*, pages 752-757, Caen, France, 2010.

**
Abstract:** Traffic classification using machine learning continues to be
an active research area. The majority of work in this area uses off-the-shelf
machine learning tools and treats them as black-box classifiers. This
approach turns all the modelling complexity into a feature selection problem.
In this paper, we build a problem-specific solution to the traffic
classification problem by designing a custom probabilistic graphical model.
Graphical models are a modular framework to design classifiers which
incorporate domain-specific knowledge. More specifically, our solution
introduces semi-supervised learning which means we learn from both labelled
and unlabelled traffic flows. We show that our solution performs
competitively compared to previous approaches while using less data and
simpler features.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M.
Borgwardt.
**A
robust Bayesian two-sample test for detecting intervals of differential
gene expression in microarray time series**.
*Journal of Computational Biology*, 17(3):1-13, 2010, doi
10.1089/cmb.2009.0175.

** Abstract:** Understanding the
regulatory mechanisms that are responsible for an organism's response to
environmental change is an important issue in molecular biology. A first and
important step towards this goal is to detect genes whose expression levels
are affected by altered external conditions. A range of methods to test for
differential gene expression, both in static as well as in time-course
experiments, have been proposed. While these tests answer the question
*whether* a gene is differentially expressed, they do not explicitly
address the question *when* a gene is differentially expressed, although
this information may provide insights into the course and causal structure of
regulatory programs. In this article, we propose a twosample test for
identifying intervals of differential gene expression in microarray time
series. Our approach is based on Gaussian process regression, can deal with
arbitrary numbers of replicates, and is robust with respect to outliers. We
apply our algorithm to study the response of *Arabidopsis thaliana*
genes to an infection by a fungal pathogen using a microarray time series
dataset covering 30,336 gene probes at 24 observed time points. In
classification experiments, our test compares favorably with existing methods
and provides additional insights into time-dependent differential
expression.

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, and D. L. Wild.
**Discovering
transcriptional modules by Bayesian data integration**.
*Bioinformatics*, 26:i158-i167, 2010.

** Abstract:**
Motivation: We present a method for directly inferring transcriptional
modules (TMs) by integrating gene expression and transcription factor binding
(ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture
model to allow data fusion on a gene-by-gene basis. This encodes the
intuition that co-expression and co-regulation are not necessarily equivalent
and hence we do not expect all genes to group similarly in both datasets. In
particular, it allows us to identify the subset of genes that share the same
structure of transcriptional modules in both datasets.

Results: We find
that by working on a gene-by-gene basis, our model is able to extract
clusters with greater functional coherence than existing methods. By
combining gene expression and transcription factor binding (ChIP-chip) data
in this way, we are better able to determine the groups of genes that are
most likely to represent underlying TMs.

Availability: If interested in
the code for the work presented in this article, please contact the
authors.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

Andreas Vlachos, Zoubin Ghahramani, and Ted Briscoe.
**Active learning
for constrained Dirichlet process mixture models**.
In *Proceedings of the 2010 Workshop on Geometrical Models of Natural
Language Semantics*, pages 57-61, Uppsala, Sweden, 2010.

**
Abstract:** Recent work applied Dirichlet Process Mixture Models to the
task of verb clustering, incorporating supervision in the form of must-links
and cannot-links constraints between instances. In this work, we introduce an
active learning approach for constraint selection employing uncertainty-based
sampling. We achieve substantial improvements over random selection on two
datasets.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary
Material, slides.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

J. Van Gael, A. Vlachos, and Z. Ghahramani.
**The infinite
HMM for unsupervised PoS tagging**.
In *Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 678-687, Singapore, August 2009.
Association for Computational Linguistics.

** Abstract:** We
extend previous work on fully unsupervised part-of-speech tagging. Using a
non-parametric version of the HMM, called the infinite HMM (iHMM), we address
the problem of choosing the number of hidden states in unsupervised Markov
models for PoS tagging. We experiment with two non-parametric priors, the
Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using
a parallelized implementation of an iHMM inference algorithm. We evaluate the
results with a variety of clustering evaluation metrics and achieve
equivalent or better performances than previously reported. Building on this
promising result we evaluate the output of the unsupervised PoS tagger as a
direct replacement for the output of a fully supervised PoS tagger for the
task of shallow parsing and compare the two evaluations.

R. Adams and Zoubin Ghahramani.
**Archipelago:
nonparametric Bayesian semi-supervised learning**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 1-8, Montréal, QC, Canada,
June 2009. Omnipress.

** Abstract:** Semi-supervised learning
(SSL), is classification where additional unlabeled data can be used to
improve accuracy. Generative approaches are appealing in this situation, as a
model of the data's probability density can assist in identifying clusters.
Nonparametric Bayesian methods, while ideal in theory due to their principled
motivations, have been difficult to apply to SSL in practice. We present a
nonparametric Bayesian method that uses Gaussian processes for the generative
model, avoiding many of the problems associated with Dirichlet process
mixture models. Our model is fully generative and we take advantage of recent
advances in Markov chain Monte Carlo algorithms to provide a practical
inference method. Our method compares favorably to competing approaches on
synthetic and real-world multi-class data.

** Comment:** This paper was awarded Honourable Mention for
Best Paper at ICML 2009.

F. Doshi-Velez and Z. Ghahramani.
**Correlated
non-parametric latent feature models**.
In *Conference on Uncertainty in Artificial Intelligence (UAI 2009)*,
pages 143-150, Montréal, QC, Canada, June 2009. AUAI Press.

** Abstract:** We are often interested in explaining data
through a set of hidden factors or features. To allow for an unknown number
of such hidden features, one can use the IBP: a non-parametric latent feature
model that does not bound the number of active features in a dataset.
However, the IBP assumes that all latent features are uncorrelated, making it
inadequate for many real-world problems. We introduce a framework for
correlated non-parametric feature models, generalising the IBP. We use this
framework to generate several specific models and demonstrate applications on
real-world datasets.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated Gibbs
sampling for the Indian buffet process**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 273-280, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We often seek to
identify co-occurring hidden features in a set of observations. The Indian
Buffet Process (IBP) provides a non-parametric prior on the features present
in each observation, but current inference techniques for the IBP often scale
poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in
the number of observations, and the uncollapsed Gibbs sampler, while linear,
is often slow to mix. We present a new linear-time collapsed Gibbs sampler
for conjugate likelihood models and demonstrate its efficacy on large
real-world datasets.

R. Silva and Z. Ghahramani.
**The hidden life of
latent variables: Bayesian learning with mixed graph models**.
*Journal of Machine Learning Research*, 10:1187-1238, June 2009.

** Abstract:** Directed acyclic graphs (DAGs) have been widely
used as a representation of conditional independence in machine learning and
statistics. Moreover, hidden or latent variables are often an important
component of graphical models. However, DAG models suffer from an important
limitation: the family of DAGs is not closed under marginalization of hidden
variables. This means that in general we cannot use a DAG to represent the
independencies over a subset of variables in a larger DAG. Directed mixed
graphs (DMGs) are a representation that includes DAGs as a special case, and
overcomes this limitation. This paper introduces algorithms for performing
Bayesian inference in Gaussian and probit DMG models. An important
requirement for inference is the specification of the distribution over
parameters of the models. We introduce a new distribution for covariance
matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian
machine learning tasks can benefit from the principle presented here: the
power to model dependencies that are generated from hidden variables, but
without necessarily modeling such variables explicitly.

W. Chu and Z. Ghahramani.
**Probabilistic models
for incomplete multi-dimensional arrays**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 89-96,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** In multiway data, each sample is
measured by multiple sets of correlated attributes. We develop a
probabilistic framework for modeling structural dependency from partially
observed multi-dimensional array data, known as pTucker. Latent components
associated with individual array dimensions are jointly retrieved while the
core tensor is integrated out. The resulting algorithm is capable of handling
large-scale data sets. We verify the usefulness of this approach by comparing
against classical models on applications to modeling amino acid fluorescence,
collaborative filtering and a number of benchmark multiway array data.

Frederik Eaton and Zoubin Ghahramani.
**Choosing a variable
to clamp: Approximate inference using conditioned belief
propagation**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 145-152,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning
Research.

** Abstract:** In this paper we propose an algorithm
for approximate inference on graphical models based on belief propagation
(BP). Our algorithm is an approximate version of Cutset Conditioning, in
which a subset of variables is instantiated to make the rest of the graph
singly connected. We relax the constraint of single-connectedness, and select
variables one at a time for conditioning, running belief propagation after
each selection. We consider the problem of determining the best variable to
clamp at each level of recursion, and propose a fast heuristic which applies
back-propagation to the BP updates. We demonstrate that the heuristic
performs better than selecting variables at random, and give experimental
results which show that it performs competitively with existing approximate
inference algorithms.

** Comment:** Code (in C++
based on libDAI).

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt.
**A kernel
method for unsupervised structured network inference**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 368-375,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Network inference is the problem
of inferring edges between a set of real-world objects, for instance,
interactions between pairs of proteins in bioinformatics. Current
kernel-based approaches to this problem share a set of common features: (i)
they are supervised and hence require labeled training data; (ii) edges in
the network are treated as mutually independent and hence topological
properties are largely ignored; (iii) they lack a statistical interpretation.
We argue that these common assumptions are often undesirable for network
inference, and propose (i) an unsupervised kernel method (ii) that takes the
global structure of the network into account and (iii) is statistically
motivated. We show that our approach can explain commonly used heuristics in
statistical terms. In experiments on social networks, dfferent variants of
our method demonstrate appealing predictive performance.

R. Silva and Z. Ghahramani.
**Factorial mixture
of Gaussians and the marginal independence model**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 5, pages 520-527, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Marginal independence
constraints play an important role in learning with graphical models. One way
of parameterizing a model of marginal independencies is by building a latent
variable model where two independent observed variables have no common latent
source. In sparse domains, however, it might be advantageous to model the
marginal observed distribution directly, without explicitly including latent
variables in the model. There have been recent advances in Gaussian and
binary models of marginal independence, but no models with non-linear
dependencies between continuous variables has been proposed so far. In this
paper, we describe how to generalize the Gaussian model of marginal
independencies based on mixtures, and how to learn parameters. This requires
a non-standard parameterization and raises difficult non-linear optimization
issues.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee.
**The block
diagonal infinite hidden Markov model**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 552-559,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** The Infinite Hidden Markov Model
(IHMM) extends hidden Markov models to have a countably infinite number of
hidden states (Beal et al., 2002; Teh et al.,
2006). We present a generalization of this framework that introduces nearly
block-diagonal structure in the transitions between the hidden states, where
blocks correspond to "sub-behaviors" exhibited by data sequences. In
identifying such structure, the model classifies, or partitions, sequence
data according to these sub-behaviors in an unsupervised way. We present an
application of this model to artificial data, a video gesture classification
task, and a musical theme labeling task, and show that components of the
model can also be applied to graph segmentation.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

A. Vlachos, A Korhonen, and Z. Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *4th Workshop on Statistical Machine Translation, EACL '09*, Athens,
Greece, March 2009.

** Abstract:** In this work, we apply
Dirichlet Process Mixture Models (DPMMs) to a learning task in natural
language processing (NLP): lexical-semantic verb clustering. We thoroughly
evaluate a method of guiding DPMMs towards a particular clustering solution
using pairwise constraints. The quantitative and qualitative evaluation
performed highlights the benefits of both standard and constrained DPMMs
compared to previously used approaches. In addition, it sheds light on the
use of evaluation measures and their practical application.

Karsten M. Borgwardt and Zoubin Ghahramani.
**Bayesian two-sample
tests**.
*arXiv*, abs/0906.4032, 2009.

** Abstract:** In this
paper, we present two classes of Bayesian approaches to the two-sample
problem. Our first class of methods extends the Bayesian t-test to include
all parametric models in the exponential family and their conjugate priors.
Our second class of methods uses Dirichlet process mixtures (DPM) of such
conjugate-exponential distributions as flexible nonparametric priors over the
unknown distributions.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated
sampling for the Indian buffet process**.
In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors,
*ICML*, volume 382 of *ACM International Conference Proceeding
Series*, page 35. acm, 2009.

** Abstract:** We often
seek to identify co-occurring hidden features in a set of observations. The
Indian Buffet Process (IBP) provides a nonparametric prior on the features
present in each observation, but current inference techniques for the IBP
often scale poorly. The collapsed Gibbs sampler for the IBP has a running
time cubic in the number of observations, and the uncollapsed Gibbs sampler,
while linear, is often slow to mix. We present a new linear-time collapsed
Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on
large real-world datasets.

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L.
Wild.
**Modeling and visualizing
uncertainty in gene expression clusters using Dirichlet process
mixtures**.
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,
6(4):615-628, 2009, doi
10.1109/TCBB.2007.70269.

** Abstract:** Although the use
of clustering methods has rapidly become one of the standard computational
approaches in the literature of microarray gene expression data, little
attention has been paid to uncertainty in the results obtained. Dirichlet
process mixture (DPM) models provide a nonparametric Bayesian alternative to
the bootstrap approach to modeling uncertainty in gene expression clustering.
Most previously published applications of Bayesian model-based clustering
methods have been to short time series data. In this paper, we present a case
study of the application of nonparametric Bayesian clustering methods to the
clustering of high-dimensional nontime series gene expression data using full
Gaussian covariances. We use the probability that two genes belong to the
same cluster in a DPM model as a measure of the similarity of these gene
expression profiles. Conversely, this probability can be used to define a
dissimilarity measure, which, for the purposes of visualization, can be input
to one of the standard linkage algorithms used for hierarchical clustering.
Biologically plausible results are obtained from the Rosetta compendium of
expression profiles which extend previously published cluster analyses of
this data.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt.
*13th Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2009)*, volume 5541 of *Lecture Notes in
Bioinformatics*, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag,
doi
10.1007/978-3-642-02008-7_14.

** Abstract:** Understanding
the regulatory mechanisms that are responsible for an organism's response to
environmental changes is an important question in molecular biology. A first
and important step towards this goal is to detect genes whose expression
levels are affected by altered external conditions. A range of methods to
test for differential gene expression, both in static as well as in
time-course experiments, have been proposed. While these tests answer the
question *whether* a gene is differentially expressed, they do not
explicitly address the question *when* a gene is differentially
expressed, although this information may provide insights into the course and
causal structure of regulatory programs. In this article, we propose a
two-sample test for identifying *intervals* of differential gene
expression in microarray time series. Our approach is based on Gaussian
process regression, can deal with arbitrary numbers of replicates and is
robust with respect to outliers. We apply our algorithm to study the response
of *Arabidopsis thaliana* genes to an infection by a fungal pathogen
using a microarray time series dataset covering 30,336 gene probes at 24 time
points. In classification experiments our test compares favorably with
existing methods and provides additional insights into time-dependent
differential expression.

Andreas Vlachos, Anna Korhonen, and Zoubin Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *Proceedings of the workshop on geometrical models of natural language
semantics*, pages 74-82. Association for Computational Linguistics,
2009.

** Abstract:** In this work, we apply Dirichlet Process
Mixture Models (DPMMs) to a learning task in natural language processing
(NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of
guiding DP- MMs towards a particular clustering solution using pairwise
constraints. The quantitative and qualitative evaluation per- formed
highlights the benefits of both standard and constrained DPMMs com- pared to
previously used approaches. In addition, it sheds light on the use of
evaluation measures and their practical application.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani.
**Metropolis
algorithms for representative subgraph sampling**.
In *Proceedings of 8th IEEE International Conference on Data Mining (ICDM
2008)*, pages 283-292, Pisa, Italy, December 2008. IEEE.
ISSN: 1550-4786.

** Abstract:** While data mining in
chemoinformatics studied graph data with dozens of nodes, systems biology and
the Internet are now generating graph data with thousands and millions of
nodes. Hence data mining faces the algorithmic challenge of coping with this
significant increase in graph size: Classic algorithms for data analysis are
often too expensive and too slow on large graphs.

While one strategy to
overcome this problem is to design novel efficient algorithms, the other is
to 'reduce' the size of the large graph by sampling. This is the scope of
this paper: We will present novel Metropolis algorithms for sampling a
'representative' small subgraph from the original large graph, with
'representative' describing the requirement that the sample shall preserve
crucial graph properties of the original graph. In our experiments, we
improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by
producing representative subgraph samples that are both smaller and of higher
quality than those produced by other methods from the literature.

H. Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In L. Niels da Vitoria, editor, *Structural, Syntactic and Statistical
Pattern Recognition*, volume 5342 of *Lecture Notes in Computer
Science (LNCS)*, pages 896-905, Berlin, Germany, December 2008. Springer
Berlin / Heidelberg.

** Abstract:** Gaussian process
classifiers (GPCs) are a fully statistical model for kernel classification.
We present a form of GPC which is robust to labeling errors in the data set.
This model allows label noise not only near the class boundaries, but also
far from the class boundaries which can result from mistakes in labelling or
gross errors in measuring the input features. We derive an outlier robust
algorithm for training this model which alternates iterations based on the EP
approximation and hyperparameter updates until convergence. We show the
usefulness of the proposed algorithm with model selection method through
simulation results.

R. Silva, W. Chu, and Zoubin Ghahramani.
**Hidden common
cause relations in relational learning**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *Advances in
Neural Information Processing Systems 20*, pages 1345-1352, Cambridge,
MA, USA, December 2008. The MIT Press.

** Abstract:** When
predicting class labels for objects within a relational database, it is often
helpful to consider a model for relationships: this allows for information
between class labels to be shared and to improve prediction performance.
However, there are different ways by which objects can be related within a
relational database. One traditional way corresponds to a Markov network
structure: each existing relation is represented by an undirected edge. This
encodes that, conditioned on input features, each object label is independent
of other object labels given its neighbors in the graph. However, there is no
reason why Markov networks should be the only representation of choice for
symmetric dependence structures. Here we discuss the case when relationships
are postulated to exist due to *hidden common causes*. We discuss how
the resulting graphical model differs from Markov networks, and how it
describes different types of real-world relational processes. A Bayesian
nonparametric classification model is built upon this graphical
representation and evaluated with several empirical studies.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/xgp

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Second-order
latent space variational Bayes for approximate Bayesian
inference**.
*IEEE Signal Processing Letters*, 15:918-921, December 2008.

** Abstract:** In this letter, we consider a variational
approximate Bayesian inference framework, latent-space variational Bayes
(LSVB), in the general context of conjugate-exponential family models with
latent variables. In the LSVB approach, we integrate out model parameters in
an exact way and then perform the variational inference over only the latent
variables. It can be shown that LSVB can achieve better estimates of the
model evidence as well as the distribution over the latent variables than the
popular variational Bayesian expectation-maximization (VBEM). However, the
distribution over the latent variables in LSVB has to be approximated in
practice. As an approximate implementation of LSVB, we propose a second-order
LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of
a first-order approximation in LSVB. SoLSVB can capture higher order
statistics neglected in VBEM and can therefore achieve a better
approximation. Examples of Gaussian mixture models are used to illustrate the
comparison between our method and VBEM, demonstrating the improvement.

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

J. Zhang, Z. Ghahramani, and Y. Yang.
**Flexible latent
variable models for multi-task learning**.
*Machine Learning*, 73(3):221-242, December 2008.

**
Abstract:** Given multiple prediction problems such as regression and
classification, we are interested in a joint inference framework which can
effectively borrow information among tasks to improve the prediction
accuracy, especially when the number of training examples per problem is
small. In this paper we propose a probabilistic framework which can support a
set of latent variable models for different multi-task learning scenarios. We
show that the framework is a generalization of standard learning methods for
single prediction problems and it can effectively model the shared structure
among different prediction tasks. Furthermore, we present efficient
algorithms for the empirical Bayes method as well as point estimation. Our
experiments on both simulated datasets and real world classification datasets
show the effectiveness of the proposed models in two evaluation settings:
standard multi-task learning setting and transfer learning setting.

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Latent space
variational Bayes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
30(12):2236-2242, November 2008.

** Abstract:** Variational
Bayesian Expectation-Maximization (VBEM), an approximate inference method for
probabilistic models based on factorizing over latent variables and model
parameters, has been a standard technique for practical Bayesian inference.
In this paper, we introduce a more general approximate inference framework
for conjugate-exponential family models, which we call Latent-Space
Variational Bayes (LSVB). In this approach, we integrate out the model
parameters in an exact way, leaving only the latent variables. It can be
shown that the LSVB approach gives better estimates of the model evidence as
well as the distribution over the latent variables than the VBEM approach,
but, in practice, the distribution over the latent variables has to be
approximated. As a practical implementation, we present a First-order LSVB
(FoLSVB) algorithm to approximate the distribution over the latent variables.
From this approximate distribution, one can also estimate the model evidence
and the posterior over the model parameters. The FoLSVB algorithm is directly
comparable to the VBEM algorithm and has the same computational complexity.
We discuss how LSVB generalizes the recently proposed collapsed variational
methods to general conjugate-exponential families. Examples based on mixtures
of Gaussians and mixtures of Bernoullis with synthetic and real-world data
sets are used to illustrate some advantages of our method over VBEM.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani.
**Statistical
models for partial membership**.
In Andrew McCallum and Sam Roweis, editors, *25th International Conference
on Machine Learning*, pages 392-399, Helsinki, Finland, July 2008.
Omnipress.

** Abstract:** We present a principled Bayesian
framework for modeling partial memberships of data points to clusters. Unlike
a standard mixture model which assumes that each data point belongs to one
and only one mixture component, or cluster, a partial membership model allows
data points to have fractional membership in multiple clusters. Algorithms
which assign data points partial memberships to clusters can be useful for
tasks such as clustering genes based on microarray data (Gasch & Eisen,
2002). Our Bayesian Partial Membership Model (BPM) uses exponential family
distributions to model each cluster, and a product of these distibtutions,
with weighted parameters, to model each datapoint. Here the weights
correspond to the degree to which the datapoint belongs to each cluster. All
parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to
perform inference and learning. We discuss relationships between the BPM and
Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA,
and fuzzy clustering. Lastly, we show some experimental results and discuss
nonparametric extensions to our model.

A. Vlachos, Z. Ghahramani, and A Korhonen.
**Dirichlet
process mixture models for verb clustering**.
In Guillaume Bouchard, Hal Daumé III, Marc Dymetman, and Yee Whye Teh,
editors, *ICML Workshop on Prior Knowledge for Text and Language
Processing*, pages 43-48, Helsinki, Finland, July 2008.

**
Abstract:** In this work we apply Dirichlet Process Mixture Models to a
learning task in natural language processing (NLP): lexical-semantic verb
clustering. We assess the performance on a dataset based on Levin's (1993)
verb classes using the recently introduced V-measure metric. In, we present a
method to add human supervision to the model in order to to influence the
solution with respect to some prior knowledge. The quantitative evaluation
performed highlights the benefits of the chosen method compared to previously
used clustering approaches.

Hyun-Chul Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In Niels da Vitoria Lobo, Takis Kasparis, Fabio Roli, James Tin-Yau Kwok,
Michael Georgiopoulos, Georgios C. Anagnostopoulos, and Marco Loog, editors,
*SSPR/SPR*, volume 5342 of *Lecture Notes in Computer Science*,
pages 896-905. Springer, 2008.

** Abstract:** Gaussian
process classifiers (GPCs) are a fully statistical model for kernel
classification. We present a form of GPC which is robust to labeling errors
in the data set. This model allows label noise not only near the class
boundaries, but also far from the class boundaries which can result from
mistakes in labelling or gross errors in measuring the input features. We
derive an outlier robust algorithm for training this model which alternates
iterations based on the EP approximation and hyperparameter updates until
convergence. We show the usefulness of the proposed algorithm with model
selection method through simulation results.

Andreas Vlachos, Zoubin Ghahramani, and Anna Korhonen.
**Dirichlet
process mixture models for verb clustering**.
In *Proceedings of the ICML workshop on Prior Knowledge for Text and
Language*, 2008.

** Abstract:** In this work we apply
Dirichlet Process Mixture Models to a learning task in natural language
processing (NLP): lexical-semantic verb clustering. We assess the performance
on a dataset based on Levin’s (1993) verb classes using the recently
introduced V- measure metric. In, we present a method to add human
supervision to the model in order to to influence the solution with respect
to some prior knowledge. The quantitative evaluation performed highlights the
benefits of the chosen method compared to previously used clustering
approaches.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani.
**Beam sampling
for the infinite hidden Markov model**.
In *25th International Conference on Machine Learning*, volume 25,
pages 1088-1095, Helsinki, Finland, 2008. Association for Computing
Machinery.

** Abstract:** The infinite hidden Markov model is
a non-parametric extension of the widely used hidden Markov model. Our paper
introduces a new inference algorithm for the infinite Hidden Markov model
called beam sampling. Beam sampling combines slice sampling, which limits the
number of states considered at each time step to a finite number, with
dynamic programming, which samples whole state trajectories efficiently. Our
algorithm typically outperforms the Gibbs sampler and is more robust. We
present applications of iHMM inference using the beam sampler on changepoint
detection and text prediction problems.

Sinead Williamson and Zoubin Ghahramani.
**Probabilistic models
for data combination in recommender systems**.
In *Learning from Multiple Sources Workshop, NIPS Conference*, Whistler
Canada, 2008.

W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi.
**Relational
learning with Gaussian processes**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, volume 19 of *Bradford
Books*, pages 289-296, Cambridge, MA, USA, September 2007. The MIT
Press.
Online contents gives pages 314-321, and 289-296 on pdf of contents.

** Abstract:** Correlation between instances is often modelled
via a kernel function using input attributes of the instances. Relational
knowledge can further reveal additional pairwise correlations between
variables of interest. In this paper, we develop a class of models which
incorporates both reciprocal relational information and input attributes
using Gaussian process techniques. This approach provides a novel
non-parametric Bayesian framework with a data-dependent prior for supervised
learning tasks. We also apply this framework to semi-supervised learning.
Experimental results on several real world data sets verify the usefulness of
this algorithm.

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

E. Meeds, Z. Ghahramani, R. Neal, and S.T. Roweis.
**Modelling
dyadic data with binary latent factors**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, Bradford Books, pages 977-984,
Cambridge, MA, USA, September 2007. The MIT Press.
Online contents gives pages 1002-1009, and 977-984 on pdf contents.

** Abstract:** We introduce binary matrix factorization, a novel
model for unsupervised matrix decomposition. The decomposition is learned by
fitting a non-parametric Bayesian probabilistic model with binary latent
variables to a matrix of dyadic data. Unlike bi-clustering models, which
assign each row or column to a single cluster based on a categorical hidden
feature, our binary feature model reflects the prior belief that items and
attributes can be associated with more than one latent cluster at a time. We
provide simple learning and inference rules for this new model and show how
to extend it to an infinite model in which the number of features is not a
priori fixed but is allowed to grow with the size of the data.

F. Pérez-Cruz, Zoubin Ghahramani, and M. Pontil.
**Conditional
graphical models**.
In G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and
S. V. N. Vishwanathan, editors, *Predicting Structured Data*, pages
265-282. The MIT Press, Cambridge, MA, USA, September 2007.
Chapter 12.

** Abstract:** In this chapter we propose a
modification of CRF-like algorithms that allows for solving large-scale
structured classification problems. Our approach consists in upper bounding
the CRF functional in order to decompose its training into independent
optimisation problems per clique. Furthermore we show that each sub-problem
corresponds to solving a multiclass learning task in each clique, which
enlarges the applicability of these tools for large-scale structural learning
problems. Before presenting the Conditional Graphical Model (CGM), as we
refer to this procedure, we review the family of CRF algorithms. We
concentrate on the best known procedures and standard generalisations of
CRFs. The ob jective of this introduction is analysing from the same
viewpoint the proposed solutions in the literature to tackle this problem,
which allows comparing their different features. We complete the chapter with
a case study, in which we show the possibility to work with large-scale
problems using CGM and that the obtained performance is comparable to the
result with CRF-like algorithms.

Z. Ghahramani, T.L. Griffiths, and P. Sollich.
**Bayesian
nonparametric latent feature models (with discussion)**.
In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M.
Smith, and M. West, editors, *Bayesian Statistics 8*, pages 201-226,
Oxford, UK, July 2007. Oxford University Press.

** Abstract:**
We describe a flexible nonparametric approach to latent variable modelling in
which the number of latent variables is unbounded. This approach is based on
a probability distribution over equivalence classes of binary matrices with a
finite number of rows, corresponding to the data points, and an unbounded
number of columns, corresponding to the latent variables. Each data point can
be associated with a subset of the possible latent variables, which we refer
to as the latent features of that data point. The binary variables in the
matrix indicate which latent feature is possessed by which data point, and
there is a potentially infinite array of features. We derive the distribution
over unbounded binary matrices by taking the limit of a distribution over
N×K binary matrices as K→∞. We define a simple generative
processes for this distribution which we call the Indian buffet process (IBP;
Griffiths and Ghahramani, 2005, 2006) by analogy
to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a
single hyperparameter which controls both the number of feature per ob ject
and the total number of features. We describe a two-parameter generalization
of the IBP which has additional flexibility, independently controlling the
number of features per object and the total number of features in the matrix.
The use of this distribution as a prior in an infinite latent feature model
is illustrated, and Markov chain Monte Carlo algorithms for inference are
described.

** Comment:** Includes discussion by David Dunson, and
rejoinder.

Katherine A. Heller and Zoubin Ghahramani.
**A nonparametric
Bayesian approach to modeling overlapping clusters**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 187-194. JMLR.org, 2007.

**
Abstract:** Although clustering data into mutually exclusive partitions has
been an extremely successful approach to unsupervised learning, there are
many situations in which a richer model is needed to fully represent the
data. This is the case in problems where data points actually simultaneously
belong to multiple, overlapping clusters. For example a particular gene may
have several functions, therefore belonging to several distinct clusters of
genes, and a biologist may want to discover these through unsupervised
modeling of gene expression data. We present a new nonparametric Bayesian
method, the Infinite Overlapping Mixture Model (IOMM), for modeling
overlapping clusters. The IOMM uses exponential family distributions to model
each cluster and forms an overlapping mixture by taking products of such
distributions, much like products of experts (Hinton, 2002). The IOMM allows
an unbounded number of clusters, and assignments of points to (multiple)
clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and
Ghahramani, 2006). The IOMM has the desirable properties of being able to
focus in on overlapping regions while maintaining the ability to model a
potentially infinite number of clusters which may overlap. We derive MCMC
inference algorithms for the IOMM and show that these can be used to cluster
movies into multiple genres.

Edward Snelson and Zoubin Ghahramani.
**Local and global
sparse Gaussian process approximations**.
In M. Meila and X. Shen, editors, *11th International Conference on
Artificial Intelligence and Statistics*. Omnipress, 2007.

**
Abstract:** Gaussian process (GP) models are flexible probabilistic
nonparametric models for regression, classification and other tasks.
Unfortunately they suffer from computational intractability for large data
sets. Over the past decade there have been many different approximations
developed to reduce this cost. Most of these can be termed global
approximations, in that they try to summarize all the training data via a
small set of support points. A different approach is that of local
regression, where many local experts account for their own part of space. In
this paper we start by investigating the regimes in which these different
approaches work well or fail. We then proceed to develop a new sparse GP
approximation which is a combination of both the global and local approaches.
Theoretically we show that it is derived as a natural extension of the
framework developed by Quiñonero-Candela and
Rasmussen for sparse GP approximations. We demonstrate the benefits of
the combined approximation on some 1D examples for illustration, and on some
large real-world data sets.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani.
**Analogical
reasoning with relational Bayesian sets**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 500-507. JMLR.org, 2007.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. There are many ways in
which objects can be related, making automated analogical reasoning very
chal- lenging. Here we develop an approach which, given a set of pairs of
related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs
A:B fit in with the set S. This addresses the question: is the relation
between objects A and B analogous to those relations found in S? We recast
this classi- cal problem as a problem of Bayesian analy- sis of relational
data. This problem is non- trivial because direct similarity between ob-
jects is not a good way of measuring analo- gies. For instance, the analogy
between an electron around the nucleus of an atom and a planet around the Sun
is hardly justified by isolated, non-relational, comparisons of an electron
to a planet, and a nucleus to the Sun. We develop a generative model for
predicting the existence of relationships and extend the framework of
Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous
a relation is to other relations. This sheds new light on an old problem,
which we motivate and illustrate through practical applications in
exploratory data analysis.

Yee Whye Teh, Dilan Görür, and Zoubin Ghahramani.
**Stick-breaking
construction for the Indian buffet process**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 556-563. JMLR.org, 2007.

**
Abstract:** The Indian buffet process (IBP) is a Bayesian nonparametric
distribution whereby objects are modelled using an unbounded number of latent
features. In this paper we derive a stick-breaking representation for the
IBP. Based on this new representation, we develop slice samplers for the IBP
that are efficient, easy to implement and are more generally applicable than
the currently available Gibbs sampler. This representation, along with the
work of Thibaux and Jordan, also illuminates interesting theoretical
connections between the IBP, Chinese restaurant processes, Beta processes and
Dirichlet processes.

T. L. Griffiths and Z. Ghahramani.
**Infinite latent
feature models and the Indian Buffet Process**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 475-482, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** We define a
probability distribution over equivalence classes of binary matrices with a
finite number of rows and an unbounded number of columns. This distribution
is suitable for use as a prior in probabilistic models that represent objects
using a potentially infinite array of features. We identify a simple
generative process that results in the same distribution over equivalence
classes, which we call the Indian buffet process. We illustrate the use of
this distribution as a prior in an infinite latent feature model, deriving a
Markov chain Monte Carlo algorithm for inference in this model and applying
the algorithm to an image dataset.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Arik Azran and Zoubin Ghahramani.
**A new approach to
data driven clustering**.
In William Cohen and Andrew Moore, editors, *23rd International Conference
on Machine Learning*, pages 57-64, Pittsburgh, PA, USA, June 2006.
Omnipress.

** Abstract:** We consider the problem of
clustering in its most basic form where only a local metric on the data space
is given. No parametric statistical model is assumed, and the number of
clusters is learned from the data. We introduce, analyze and demonstrate a
novel approach to clustering where data points are viewed as nodes of a
graph, and pairwise similarities are used to derive a transition probability
matrix P for a Markov random walk between them. The algorithm automatically
reveals structure at increasing scales by varying the number of steps taken
by this random walk. Points are represented as rows of Pt, which are the
t-step distributions of the walk starting at that point; these distributions
are then clustered using a KL-minimizing iterative algorithm. Both the number
of clusters, and the number of steps that best reveal it, are found by
optimizing spectral properties of P.

Arik Azran and Zoubin Ghahramani.
**Spectral methods
for automatic multiscale data clustering**.
In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*,
pages 190-197, New York, NY, USA, June 2006. IEEE Computer Society, doi
10.1109/CVPR.2006.289.

** Abstract:** Spectral clustering
is a simple yet powerful method for finding structure in data using spectral
properties of an associated pairwise similarity matrix. This paper provides
new insights into how the method works and uses these to derive new
algorithms which given the data alone automatically learn different plausible
data partitionings. The main theoretical contribution is a generalization of
a key result in the field, the multicut lemma (Meila 2001). We use this
generalization to derive two algorithms. The first uses the eigenvalues of a
given affinity matrix to infer the number of clusters in data, and the second
combines learning the affinity matrix with inferring the number of clusters.
A hierarchical implementation of the algorithms is also derived. The
algorithms are theoretically motivated and demonstrated on nontrivial data
sets.

Wei Chu, Zoubin Ghahramani, Roland Krause, and David L. Wild.
**Identifying
protein complexes in high-throughput protein interaction screens using an
infinite latent feature model**.
In Russ B. Altman, Tiffany Murray, Teri E. Klein, A. Keith Dunker, and Lawrence
Hunter, editors, *Pacific Symposium on Biocomputing*, pages 231-242.
World Scientific, 2006.

** Abstract:** We propose a Bayesian
approach to identify protein complexes and their constituents from
high-throughput protein-protein interaction screens. An infinite latent
feature model that allows for multi-complex membership by individual proteins
is coupled with a graph diffusion kernel that evaluates the likelihood of two
proteins belonging to the same complex. Gibbs sampling is then used to infer
a catalog of protein complexes from the interaction screen data. An advantage
of this model is that it places no prior constraints on the number of
complexes and automatically infers the number of significant complexes from
the data. Validation results using affinity purification/mass spectrometry
experimental data from yeast RNA-processing complexes indicate that our
method is capable of partitioning the data in a biologically meaningful way.
A supplementary web site containing larger versions of the figures is
available at http://public.kgi.edu/wild/PSBO6/index.html.

Wei Chu, Zoubin Ghahramani, Alexei A. Podtelezhnikov, and David L. Wild.
**Bayesian
segmental models with multiple sequence alignment profiles for protein
secondary structure and contact map prediction**.
*IEEE/ACM Trans. Comput. Biology Bioinform.*, 3(2):98-113, 2006.

** Abstract:** In this paper, we develop a segmental semi-Markov
model (SSMM) for protein secondary structure prediction which incorporates
multiple sequence alignment profiles with the purpose of improving the
predictive performance. The segmental model is a generalization of the hidden
Markov model where a hidden state generates segments of various length and
secondary structure type. A novel parameterized model is proposed for the
likelihood function that explicitly represents multiple sequence alignment
profiles to capture the segmental conformation. Numerical results on
benchmark data sets show that incorporating the profiles results in
substantial improvements and the generalization performance is promising. By
incorporating the information from long range interactions in beta-sheets,
this model is also capable of carrying out inference on contact maps. This is
an important advantage of probabilistic generative models over the
traditional discriminative approach to protein secondary structure
prediction. The Web server of our algorithm and supplementary materials are
available at http://public.kgi.edu/-wild/bsm.html.

Katherine A. Heller and Zoubin Ghahramani.
**A simple Bayesian
framework for content-based image retrieval**.
In *CVPR*, pages 2110-2117. IEEE Computer Society, 2006.

** Abstract:** We present a Bayesian framework for content-based
image retrieval which models the distribution of color and texture features
within sets of related images. Given a userspecified text query (e.g.
"penguins") the system first extracts a set of images, from a labelled
corpus, corresponding to that query. The distribution over features of these
images is used to compute a Bayesian score for each image in a large
unlabelled corpus. Unlabelled images are then ranked using this score and the
top images are returned. Although the Bayesian score is based on computing
marginal likelihoods, which integrate over model parameters, in the case of
sparse binary data the score reduces to a single matrix-vector multiplication
and is therefore extremely efficient to compute. We show that our method
works surprisingly well despite its simplicity and the fact that no relevance
feedback is used. We compare different choices of features, and evaluate our
results using human subjects.

Hyun-Chul Kim and Zoubin Ghahramani.
**Bayesian Gaussian
process classification with the EM-EP algorithm**.
*IEEE Trans. Pattern Anal. Mach. Intell.*, 28(12):1948-1959, 2006.

** Abstract:** Gaussian process classifiers (GPCs) are Bayesian
probabilistic kernel classifiers. In GPCs, the probability of belonging to a
certain class at an input location is monotonically related to the value of
some latent function at that location. Starting from a Gaussian process prior
over this latent function, data are used to infer both the posterior over the
latent function and the values of hyperparameters to determine various
aspects of the function. Recently, the expectation propagation (EP) approach
has been proposed to infer the posterior over the latent function. Based on
this work, we present an approximate EM algorithm, the EM-EP algorithm, to
learn both the latent function and the hyperparameters. This algorithm is
found to converge in practice and provides an efficient Bayesian framework
for learning hyperparameters of the kernel. A multiclass extension of the
EM-EP algorithm for GPCs is also derived. In the experimental results, the
EM-EP algorithms are as good or better than other methods for GPCs or Support
Vector Machines (SVMs) with cross-validation.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang.
**Appearance-based
gender classification with Gaussian processes**.
*Pattern Recognition Letters*, 27(6):618-626, 2006.

**
Abstract:** This paper concerns the gender classification task of
discriminating between images of faces of men and women from face images. In
appearance-based approaches, the initial images are preprocessed (e.g.
normalized) and input into classifiers. Recently, support vector machines
(SVMs) which are popular kernel classifiers have been applied to gender
classification and have shown excellent performance. SVMs have difficulty in
determining the hyperparameters in kernels (using cross-validation). We
propose to use Gaussian process classifiers (GPCs) which are Bayesian kernel
classifiers. The main advantage of GPCs over SVMs is that they determine the
hyperparameters of the kernel based on Bayesian model selection criterion.
The experimental results show that our methods outperformed SVMs with
cross-validation in most of data sets. Moreover, the kernel hyperparameters
found by GPCs using Bayesian methods can be used to improve SVM
performance.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang.
**Gender
classification with Bayesian kernel methods**.
In William W. Cohen and Andrew Moore, editors, *IJCNN*, volume 148 of
*ACM International Conference Proceeding Series*, pages 3371-3376.
Association for Computing Machinery, 2006.

** Abstract:** We
consider the gender classification task of discriminating between images of
faces of men and women from face images. In appearance-based approaches, the
initial images are preprocessed (e.g. normalized) and input into classifiers.
Recently, SVMs which are popular kernel classifiers have been applied to
gender classification and have shown excellent performance. We propose to use
one of Bayesian kernel methods which is Gaussian process classifiers (GPCs)
for gender classification. The main advantage of Bayesian kernel methods such
as GPCs over SVMs is that they determine the hyperparameters of the kernel
based on Bayesian model selection criterion. Our results show that GPCs
outperformed SVMs with cross validation.

Iain Murray, Zoubin Ghahramani, and David J. C. MacKay.
**MCMC for
doubly-intractable distributions**.
In *UAI*. AUAI Press, 2006.

** Abstract:** Markov Chain
Monte Carlo (MCMC) algorithms are routinely used to draw samples from
distributions with intractable normalization constants. However, standard
MCMC algorithms do not apply to doubly-intractable distributions in which
there are additional parameter-dependent normalization terms; for example,
the posterior over parameters of an undirected graphical model. An ingenious
auxiliary-variable scheme (Møller et al., 2004) offers a solution: exact
sampling (Propp and Wilson, 1996) is used to sample from a
Metropolis-Hastings proposal for which the acceptance probability is
tractable. Unfortunately the acceptance probability of these expensive
updates can be low. This paper provides a generalization of Møller et al.
(2004) and a new MCMC algorithm, which obtains better acceptance
probabilities for the same amount of exact sampling, and removes the need to
estimate model parameters before sampling begins.

Ricardo Silva and Zoubin Ghahramani.
**Bayesian inference
for Gaussian mixed graph models**.
In *UAI*. AUAI Press, 2006.

** Abstract:** We introduce
priors and algorithms to perform Bayesian inference in Gaussian models
defined by acyclic directed mixed graphs. Such a class of graphs, composed of
directed and bi-directed edges, is a representation of conditional
independencies that is closed under marginalization and arises naturally from
causal models which allow for unmeasured confounding. Monte Carlo methods and
a variational approximation for such models are presented. Our algorithms for
Bayesian inference allow the evaluation of posterior distributions for
several quantities of interest, including causal effects that are not
identifiable from data alone but could otherwise be inferred where
informative prior knowledge about confounding is available.

Edward Snelson and Zoubin Ghahramani.
**Sparse Gaussian
processes using pseudo-inputs**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 1257-1264. The MIT Press,
Cambridge, MA, 2006.

** Abstract:** We present a new Gaussian
process (GP) regression model whose covariance is parameterized by the the
locations of M pseudo-input points, which we learn by a gradient based
optimization. We take M<<N, where N is the number of real data points,
and hence obtain a sparse regression method which has O(NM^{2})
training cost and O(M^{2}) prediction cost per test case. We also
find hyperparameters of the covariance function in the same joint
optimization. The method can be viewed as a Bayesian regression model with
particular input dependent noise. The method turns out to be closely related
to several other sparse GP approaches, and we discuss the relation in detail.
We finally demonstrate its performance on some large data sets, and make a
direct comparison to other sparse GP methods. We show that our method can
match full GP performance with small M, i.e. very sparse solutions, and it
significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani.
**Variable
noise and dimensionality reduction for sparse Gaussian processes**.
In R. Dechter and T. S. Richardson, editors, *22nd Conference on Uncertainty
in Artificial Intelligence*. AUAI Press, 2006.

**
Abstract:** The sparse pseudo-input Gaussian process (SPGP) is a new
approximation method for speeding up GP regression in the case of a large
number of data points N. The approximation is controlled by the gradient
optimization of a small set of M pseudo-inputs, thereby reducing complexity
from O(N^{3}) to O(NM^{2}). One limitation of the SPGP is
that this optimization space becomes impractically big for high dimensional
data sets. This paper addresses this limitation by performing automatic
dimensionality reduction. A projection of the input space to a low
dimensional space is learned in a supervised manner, alongside the
pseudo-inputs, which now live in this reduced space. The paper also
investigates the suitability of the SPGP for modeling data with
input-dependent noise. A further extension of the model is made to make it
even more powerful in this regard - we learn an uncertainty parameter for
each pseudo-input. The combination of sparsity, reduced dimension, and
input-dependent noise makes it possible to apply GPs to much larger and more
complex data sets than was previously practical. We demonstrate the benefits
of these methods on several synthetic and real world problems.

Frank Wood, Thomas L. Griffiths, and Zoubin Ghahramani.
**A non-parametric
Bayesian method for inferring hidden causes**.
In *UAI*. AUAI Press, 2006.

** Abstract:** We present a
non-parametric Bayesian approach to structure learning with hidden causes.
Previous Bayesian treatments of this problem define a prior over the number
of hidden causes and use algorithms such as reversible jump Markov chain
Monte Carlo to move between solutions. In contrast, we assume that the number
of hidden causes is unbounded, but only a finite number influence observable
variables. This makes it possible to use a Gibbs sampler to approximate the
distribution over causal structures. We evaluate the performance of both
approaches in discovering hidden causes in simulated data, and use our
non-parametric approach to discover hidden causes in a real medical
dataset.

Edward Snelson and Zoubin Ghahramani.
**Compact
approximations to Bayesian predictive distributions**.
In *22nd International Conference on Machine Learning*, Bonn, Germany,
August 2005. Omnipress.

** Abstract:** We provide a general
framework for learning precise, compact, and fast representations of the
Bayesian predictive distribution for a model. This framework is based on
minimizing the KL divergence between the true predictive density and a
suitable compact approximation. We consider various methods for doing this,
both sampling based approximations, and deterministic approximations such as
expectation propagation. These methods are tested on a mixture of Gaussians
model for density estimation and on binary linear classification, with both
synthetic data sets for visualization and several real data sets. Our results
show significant reductions in prediction time and memory footprint.

Matthew J. Beal, Francesco Falciani, Zoubin Ghahramani, Claudia Rangel, and
David L. Wild.
**A Bayesian
approach to reconstructing genetic regulatory networks with hidden
factors**.
*Bioinformatics*, 21(3):349-356, 2005.

** Abstract:**
Motivation: We have used state-space models (SSMs) to reverse engineer
transcriptional networks from highly replicated gene expression profiling
time series data obtained from a well-established model of T cell activation.
SSMs are a class of dynamic Bayesian networks in which the observed
measurements depend on some hidden state variables that evolve according to
Markovian dynamics. These hidden variables can capture effects that cannot be
directly measured in a gene expression profiling experiment, for example:
genes that have not been included in the microarray, levels of regulatory
proteins, the effects of mRNA and protein degradation, etc. Results: We have
approached the problem of inferring the model structure of these state-space
models using both classical and Bayesian methods. In our previous work, a
bootstrap procedure was used to derive classical confidence intervals for
parameters representing `gene-gene' interactions over time. In this article,
variational approximations are used to perform the analogous model selection
task in the Bayesian context. Certain interactions are present in both the
classical and the Bayesian analyses of these regulatory networks. The
resulting models place JunB and JunD at the centre of the mechanisms that
control apoptosis and proliferation. These mechanisms are key for clonal
expansion and for controlling the long term behavior (e.g. programmed cell
death) of these cells.

Wei Chu and Zoubin Ghahramani.
**Gaussian processes
for ordinal regression**.
*Journal of Machine Learning Research*, 6:1019-1041, 2005.

** Abstract:** We present a probabilistic kernel approach to
ordinal regression based on Gaussian processes. A threshold model that
generalizes the probit function is used as the likelihood function for
ordinal variables. Two inference techniques, based on the Laplace
approximation and the expectation propagation algorithm respectively, are
derived for hyperparameter learning and model selection. We compare these two
Gaussian process approaches with a previous ordinal regression method based
on support vector machines on some benchmark and real-world data sets,
including applications of ordinal regression to collaborative filtering and
gene expression analysis. Experimental results on these data sets verify the
usefulness of our approach.

Wei Chu and Zoubin Ghahramani.
**Preference learning
with Gaussian processes**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 137-144. acm,
2005.

** Abstract:** In this paper, we propose a probabilistic
kernel approach to preference learning based on Gaussian processes. A new
likelihood function is proposed to capture the preference relations in the
Bayesian framework. The generalized formulation is also applicable to tackle
many multiclass problems. The overall approach has the advantages of Bayesian
methods for model selection and probabilistic prediction. Experimental
results compared against the constraint classification approach on several
benchmark datasets verify the usefulness of this algorithm.

Wei Chu, Zoubin Ghahramani, Francesco Falciani, and David L. Wild.
**Biomarker
discovery in microarray gene expression data with Gaussian
processes**.
*Bioinformatics*, 21(16):3385-3393, 2005.

** Abstract:**
MOTIVATION: In clinical practice, pathological phenotypes are often labelled
with ordinal scales rather than binary, e.g. the Gleason grading system for
tumour cell differentiation. However, in the literature of microarray
analysis, these ordinal labels have been rarely treated in a principled way.
This paper describes a gene selection algorithm based on Gaussian processes
to discover consistent gene expression patterns associated with ordinal
clinical phenotypes. The technique of automatic relevance determination is
applied to represent the significance level of the genes in a Bayesian
inference framework. RESULTS: The usefulness of the proposed algorithm for
ordinal labels is demonstrated by the gene expression signature associated
with the Gleason score for prostate cancer data. Our results demonstrate how
multi-gene markers that may be initially developed with a diagnostic or
prognostic application in mind are also useful as an investigative tool to
reveal associations between specific molecular and cellular events and
features of tumour physiology. Our algorithm can also be applied to
microarray data with binary labels with results comparable to other methods
in the literature.

Katherine A. Heller and Zoubin Ghahramani.
**Bayesian
hierarchical clustering**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 297-304.
Association for Computing Machinery, 2005.

** Abstract:** We
present a novel algorithm for agglomerative hierarchical clustering based on
evaluating marginal likelihoods of a probabilistic model. This algorithm has
several advantages over traditional distance-based agglomerative clustering
algorithms. (1) It defines a probabilistic model of the data which can be
used to compute the predictive distribution of a test point and the
probability of it belonging to any of the existing clusters in the tree. (2)
It uses a model-based criterion to decide on merging clusters rather than an
ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide
which merges are advantageous and to output the recommended depth of the
tree. (4) The algorithm can be interpreted as a novel fast bottom-up
approximate inference method for a Dirichlet process (i.e. countably
infinite) mixture model (DPM). It provides a new lower bound on the marginal
likelihood of a DPM by summing over exponentially many clusterings of the
data in polynomial time. We describe procedures for learning the model
hyperpa-rameters, computing the predictive distribution, and extensions to
the algorithm. Experimental results on synthetic and real-world data sets
demonstrate useful properties of the algorithm.

Iain Murray, David J. C. MacKay, Zoubin Ghahramani, and John Skilling.
**Nested sampling
for Potts models**.
In *NIPS*, 2005.

** Abstract:** Nested sampling is a new
Monte Carlo method by Skilling intended for general Bayesian computation.
Nested sampling provides a robust alternative to annealing-based methods for
computing normalizing constants. It can also generate estimates of other
quantities such as posterior expectations. The key technical requirement is
an ability to draw samples uniformly from the prior subject to a constraint
on the likelihood. We provide a demonstration with the Potts model, an
undirected graphical model.

JaeMo Sung, Sung Yang Bang, Seungjin Choi, and Zoubin Ghahramani.
**U-likelihood and
u-updating algorithms: Statistical inference in latent variable
models**.
In João Gama, Rui Camacho, Pavel Brazdil, Alípio Jorge, and Luís
Torgo, editors, *ECML*, volume 3720 of *Lecture Notes in Computer
Science*, pages 377-388. Springer, 2005.

** Abstract:**
In this paper we consider latent variable models and introduce a new
U-likelihood concept for estimating the distribution over hidden variables.
One can derive an estimate of parameters from this distribution. Our approach
differs from the Bayesian and Maximum Likelihood (ML) approaches. It gives an
alternative to Bayesian inference when we don't want to define a prior over
parameters and gives an alternative to the ML method when we want a better
estimate of the distribution over hidden variables. As a practical
implementation, we present a U-updating algorithm based on the mean field
theory to approximate the distribution over hidden variables from the
U-likelihood. This algorithm captures some of the correlations among hidden
variables by estimating reaction terms. Those reaction terms are found to
penalize the likelihood. We show that the U-updating algorithm becomes the EM
algorithm as a special case in the large sample limit. The useful behavior of
our method is confirmed for the case of mixture of Gaussians by comparing to
the EM algorithm.

Jian Zhang, Zoubin Ghahramani, and Yiming Yang.
**Learning
multiple related tasks using latent independent component analysis**.
In *NIPS*, 2005.

** Abstract:** We propose a
probabilistic model based on Independent Component Analysis for learning
multiple related tasks. In our model the task parameters are assumed to be
generated from independent sources which account for the relatedness of the
tasks. We use Laplace distributions to model hidden sources which makes it
possible to identify the hidden, independent components instead of just
modeling correlations. Furthermore, our model enjoys a sparsity property
which makes it both parsimonious and robust. We also propose efficient
algorithms for both empirical Bayes method and point estimation. Our
experimental results on two multi-label text classification data sets show
that the proposed approach is promising.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani.
**Warped Gaussian
processes**.
In S. Thrun, L. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 337-344, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We generalise
the Gaussian process (GP) framework for regression by learning a nonlinear
transformation of the GP outputs. This allows for non-Gaussian processes and
non-Gaussian noise. The learning algorithm chooses a nonlinear transformation
such that transformed data is well-modelled by a GP. This can be seen as
including a preprocessing transformation as an integral part of the
probabilistic modelling problem, rather than as an ad-hoc step. We
demonstrate on several real regression problems that learning the
transformation can lead to significantly better performance than using a
regular GP, or a GP with a fixed transformation.

Philip E. Bourne, C. K. J. Allerston, Werner G. Krebs, Wilfred W. Li, Ilya N.
Shindyalov, Adam Godzik, Iddo Friedberg, Tong Liu, David L. Wild, Seungwoo
Hwang, Zoubin Ghahramani, Li Chen, and John D. Westbrook.
**The status of
structural genomics defined through the analysis of current targets and
structures**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
375-386. World Scientific, 2004.

** Abstract:** Structural
genomics-large-scale macromolecular 3-dimenional structure determination-is
unique in that major participants report scientific progress on a weekly
basis. The target database (TargetDB) maintained by the Protein Data Bank
(http://targetdb.pdb.org) reports this progress through the status of each
protein sequence (target) under consideration by the major structural
genomics centers worldwide. Hence, TargetDB provides a unique opportunity to
analyze the potential impact that this major initiative provides to
scientists interested in the sequence-structure-function-disease paradigm.
Here we report such an analysis with a focus on: (i) temporal
characteristics-how is the project doing and what can we expect in the
future? (ii) target characteristics-what are the predicted functions of the
proteins targeted by structural genomics and how biased is the target set
when compared to the PDB and to predictions across complete genomes? (iii)
structures solved-what are the characteristics of structures solved thus far
and what do they contribute? The analysis required a more extensive database
of structure predictions using different methods integrated with data from
other sources. This database, associated tools and related data sources are
available from http://spam.sdsc.edu.

Wei Chu, Zoubin Ghahramani, and David L. Wild.
**A graphical
model for protein secondary structure prediction**.
In Carla E. Brodley, editor, *ICML*, volume 69 of *ACM
International Conference Proceeding Series*. acm, 2004.

**
Abstract:** In this paper, we present a graphical model for protein
secondary structure prediction. This model extends segmental semi-Markov
models (SSMM) to exploit multiple sequence alignment profiles which contain
information from evolutionarily related sequences. A novel parameterized
model is proposed as the likelihood function for the SSMM to capture the
segmental conformation. By incorporating the information from long range
interactions in β-sheets, this model is capable of carrying out inference on
contact maps. The numerical results on benchmark data sets show that
incorporating the profiles results in substantial improvements and the
generalization performance is promising.

Wei Chu, Zoubin Ghahramani, and David L. Wild.
**Protein
secondary structure prediction using sigmoid belief networks to parameterize
segmental semi-Markov models**.
In *ESANN*, pages 81-86, 2004.

** Abstract:** In this
paper, we merge the parametric structure of neural networks into a segmental
semi-Markov model to set up a Bayesian framework for protein structure
prediction. The parametric model, which can also be regarded as an extension
of a sigmoid belief network, captures the underlying dependency in residue
sequences. The results of numerical experiments indicate the usefulness of
this approach.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin
Ghahramani, and David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and
Teri E. Klein, editors, *Pacific Symposium on Biocomputing*, pages
399-410. World Scientific, 2004.

** Abstract:** We describe a
novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the theory of
infinite Gaussian mixtures models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known three-dimensional structures and G-protein
coupled receptor sequences. The consistency of the clusters indicate that our
method is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
both reflects and extends their SCOP classifications. A supplementray web
site containing larger versions of the figures is available at
http://public.kgi.edu/approximately wid/PSB04/index.html

Iain Murray and Zoubin Ghahramani.
**Bayesian learning
in undirected graphical models: Approximate MCMC algorithms**.
In David Maxwell Chickering and Joseph Y. Halpern, editors, *UAI*, pages
392-399. AUAI Press, 2004.

** Abstract:** Bayesian learning
in undirected graphical models|computing posterior distributions over
parameters and predictive quantities is exceptionally difficult. We
conjecture that for general undirected models, there are no tractable MCMC
(Markov Chain Monte Carlo) schemes giving the correct equilibrium
distribution over parameters. While this intractability, due to the partition
function, is familiar to those performing parameter optimisation, Bayesian
learning of posterior distributions over undirected model parameters has been
unexplored and poses novel challenges. we propose several approximate MCMC
schemes and test on fully observed binary models (Boltzmann machines) for a
small coronary heart disease data set and larger artificial systems. While
approximations must perform well on the model, their interaction with the
sampling scheme is also important. Samplers based on variational mean- field
approximations generally performed poorly, more advanced methods using loopy
propagation, brief sampling and stochastic dynamics lead to acceptable
parameter posteriors. Finally, we demonstrate these techniques on a Markov
random field with hidden variables.

Yuan (Alan) Qi, Thomas P. Minka, Rosalind W. Picard, and Zoubin Ghahramani.
**Predictive
automatic relevance determination by expectation propagation**.
In Carla E. Brodley, editor, *ICML*, volume 69 of *ACM
International Conference Proceeding Series*. Association for Computing
Machinery, 2004.

** Abstract:** In many real-world
classification problems the input contains a large number of potentially
irrelevant features. This paper proposes a new Bayesian framework for
determining the relevance of input features. This approach extends one of the
most successful Bayesian methods for feature selection and sparse learning,
known as Automatic Relevance Determination (ARD). ARD finds the relevance of
features by optimizing the model marginal likelihood, also known as the
evidence. We show that this can lead to overfitting. To address this problem,
we propose Predictive ARD based on estimating the predictive performance of
the classifier. While the actual leave-one-out predictive performance is
generally very costly to compute, the expectation propagation (EP) algorithm
proposed by Minka provides an estimate of this predictive performance as a
side-effect of its iterations. We exploit this in our algorithm to do feature
selection, and to select data points in a sparse Bayesian kernel classifier.
Moreover, we provide two other improvements to previous algorithms, by
replacing Laplace's approximation with the generally more accurate EP, and by
incorporating the fast optimization algorithm proposed by Faul and Tipping.
Our experiments show that our method based on the EP estimate of predictive
performance is more accurate on test data than relevance determination by
optimizing the evidence.

Claudia Rangel, John Angus, Zoubin Ghahramani, Maria Lioumi, Elizabeth
Sotheran, Alessia Gaiba, David L. Wild, and Francesco Falciani.
**Modeling t-cell
activation using gene expression profiling and state-space models**.
*Bioinformatics*, 20(9):1361-1372, 2004.

** Abstract:**
Motivation: We have used state-space models to reverse engineer
transcriptional networks from highly replicated gene expression profiling
time series data obtained from a well-established model of T-cell activation.
State space models are a class of dynamic Bayesian networks that assume that
the observed measurements depend on some hidden state variables that evolve
according to Markovian dynamics. These hidden variables can capture effects
that cannot be measured in a gene expression profiling experiment, e.g. genes
that have not been included in the microarray, levels of regulatory proteins,
the effects of messenger RNA and protein degradation, etc. Results: Bootstrap
confidence intervals are developed for parameters representing `gene-gene'
interactions over time. Our models represent the dynamics of T-cell
activation and provide a methodology for the development of rational and
experimentally testable hypotheses. Availability: Supplementary data and
Matlab computer source code will be made available on the web at the URL
given below. Supplementary information: .

Sebastian Thrun, Yufeng Liu, Daphne Koller, Andrew Y. Ng, Zoubin Ghahramani,
and Hugh F. Durrant-Whyte.
**Simultaneous
localization and mapping with sparse extended information filters**.
*I. J. Robotic Res.*, 23(7-8):693-716, 2004.

**
Abstract:** This paper describes a scalable algorithm for the simultaneous
mapping and localization (SLAM) problem. SLAM is the problem of acquiring a
map of a static environment with a mobile robot. The vast majority of SLAM
algorithms are based on the extended Kalman filter (EKF). This paper
advocates an algorithm that relies on the dual of the EKF, the extended
information filter (EIF). We show that when represented in the information
form, map posteriors are dominated by a small number of links that tie
together nearby features in the map. This insight is developed into a sparse
variant of the EIF, called the sparse extended information filters (SEIF).
SEIFs represent maps by graphical networks of features that are locally
interconnected, where links represent relative information between pairs of
nearby features, as well as information about the robot's pose relative to
the map. We show that all essential update equations in SEIFs can be executed
in constant time, irrespective of the size of the map. We also provide
empirical results obtained for a benchmark data set collected in an outdoor
environment, and using a multi-robot mapping simulation.

Jian Zhang, Zoubin Ghahramani, and Yiming Yang.
**A probabilistic
model for online document clustering with application to novelty
detection**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*NIPS*. MIT Press, 2004.

** Abstract:** In this paper
we propose a probabilistic model for online document clustering. We use
non-parametric Dirichlet process prior to model the growing number of
clusters, and use a prior of general English language model as the base
distribution to handle the generation of novel clusters. Furthermore, cluster
uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We
use empirical Bayes method to estimate hyperparameters based on a historical
dataset. Our probabilistic model is applied to the novelty detection task in
Topic Detection and Tracking (TDT) and compared with existing approaches in
the literature.

Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, and John D. Lafferty.
**Nonparametric
transforms of graph kernels for semi-supervised learning**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*NIPS*. MIT Press, 2004.

** Abstract:** We present an
algorithm based on convex optimization for constructing kernels for
semi-supervised learning. The kernel matrices are derived from the spectral
decomposition of graph Laplacians, and combine labeled and unlabeled data in
a systematic fashion. Unlike previous work using diffusion kernels and
Gaussian random field kernels, a nonparametric kernel approach is presented
that incorporates order constraints during optimization. This results in
flexible kernels and avoids the need to choose among different parametric
forms. Our approach relies on a quadratically constrained quadratic program
(QCQP), and is computationally feasible for large datasets. We evaluate the
kernels on real datasets using support vector machines, with encouraging
results.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Bayesian Monte
Carlo**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 489-496, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We investigate
Bayesian alternatives to classical Monte Carlo methods for evaluating
integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior
knowledge, such as smoothness of the integrand, into the estimation. In a
simple problem we show that this outperforms any classical importance
sampling method. We also attempt more challenging multidimensional integrals
involved in computing marginal likelihoods of statistical models (a.k.a.
partition functions and model evidences). We find that Bayesian Monte Carlo
outperformed Annealed Importance Sampling, although for very high dimensional
problems or problems with massive multimodality BMC may be less adequate. One
advantage of the Bayesian approach to Monte Carlo is that samples can be
drawn from any distribution. This allows for the possibility of active design
of sample points so as to maximise information gain.

Zoubin Ghahramani.
**Unsupervised
Learning**.
In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors,
*Advanced Lectures on Machine Learning*, volume 3176 of *Lecture
Notes in Computer Science*, pages 72-112. Springer, 2003.

** Abstract:** We give a tutorial and overview of the field of
unsupervised learning from the perspective of statistical modelling.
Unsupervised learning can be motivated from information theoretic and
Bayesian principles. We briefly review basic models in unsupervised learning,
including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov
models, state-space models, and many variants and extensions. We derive the
EM algorithm and give an overview of fundamental concepts in graphical
models, and inference algorithms on graphs. This is followed by a quick tour
of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC),
Laplace approximation, BIC, variational approximations, and expectation
propagation (EP). The aim of this chapter is to provide a high-level view of
the field. Along the way, many state-of-the-art ideas and future directions
are also reviewed.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani.
**On the
convergence of bound optimization algorithms**.
In Christopher Meek and Uffe Kjærulff, editors, *UAI*, pages
509-516. Morgan Kaufmann, 2003.

** Abstract:** Many
practitioners who use EM and related algorithms complain that they are
sometimes slow. When does this happen, and what can be done about it? In this
paper, we study the general class of bound optimization algorithms -
including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP -
and their relationship to direct optimization algorithms such as
gradientbased methods for parameter learning. We derive a general
relationship between the updates performed by bound optimization methods and
those of gradient and second-order methods and identify analytic conditions
under which bound optimization algorithms exhibit quasi-Newton behavior, and
under which they possess poor, first-order convergence. Based on this
analysis, we consider several specific algorithms, interpret and analyze
their convergence properties and provide some recipes for preprocessing input
to these algorithms to yield faster convergence behavior. We report empirical
results supporting our analysis and showing that simple data preprocessing
can result in dramatically improved performance of bound optimizers in
practice.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani.
**Optimization
with EM and expectation-conjugate-gradient**.
In Tom Fawcett and Nina Mishra, editors, *ICML*, pages 672-679. AAAI
Press, 2003.

** Abstract:** We show a close relationship
between the Expectation- Maximization (EM) algorithm and direct optimization
algorithms such as gradientbased methods for parameter learning. We identify
analytic conditions under which EM exhibits Newton-like behavior, and
conditions under which it possesses poor, first-order convergence. Based on
this analysis, we propose two novel algorithms for maximum likelihood
estimation of latent variable models, and report empirical results showing
that, as predicted by theory, the proposed new algorithms can substantially
outperform standard EM in terms of speed of convergence in certain cases.

Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty.
**Semi-supervised
learning using Gaussian fields and harmonic functions**.
In Tom Fawcett and Nina Mishra, editors, *ICML*, pages 912-919. AAAI
Press, 2003.

** Abstract:** An approach to semi-supervised
learning is proposed that is based on a Gaussian random field model. Labeled
and unlabeled data are represented as vertices in a weighted graph, with edge
weights encoding the similarity between instances. The learning problem is
then formulated in terms of a Gaussian random field on this graph, where the
mean of the field is characterized in terms of harmonic functions, and is
efficiently obtained using matrix methods or belief propagation. The
resulting learning algorithms have intimate connections with random walks,
electric networks, and spectral graph theory. We discuss methods to
incorporate class priors and the predictions of classifiers obtained by
supervised learning. We also propose a method of parameter learning by
entropy minimization, and show the algorithm's ability to perform feature
selection. Promising experimental results are presented for synthetic data,
digit classification, and text classification tasks.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen.
**The infinite
hidden Markov model**.
In *Advances in Neural Information Processing Systems 14*, pages
577-584, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We show that it is possible to extend hidden Markov models to
have a countably infinite number of hidden states. By using the theory of
Dirichlet processes we can implicitly integrate out the infinitely many
transition parameters, leaving only three hyperparameters which can be
learned from data. These three hyperparameters define a hierarchical
Dirichlet process capable of capturing a rich set of transition dynamics. The
three hyperparameters control the time scale of the dynamics, the sparsity of
the underlying state-transition matrix, and the expected number of distinct
hidden states in a finite sequence. In this framework it is also natural to
allow the alphabet of emitted symbols to be infinite - consider, for
example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Infinite mixtures of
Gaussian process experts**.
In *Advances in Neural Information Processing Systems 14*, pages
881-888, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We present an extension to the Mixture of Experts (ME) model,
where the individual experts are Gaussian Process (GP) regression models.
Using an input-dependent adaptation of the Dirichlet Process, we implement a
gating network for an infinite number of Experts. Inference in this model may
be done efficiently using a Markov Chain relying on Gibbs sampling. The model
allows the effective covariance function to vary with the inputs, and may
handle large datasets - thus potentially overcoming two of the biggest
hurdles with GP models. Simulations show the viability of this approach.

Rong Jin and Zoubin Ghahramani.
**Learning with
multiple labels**.
In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors,
*NIPS*, pages 897-904. MIT Press, 2002.

** Abstract:**
In this paper, we study a special kind of learning problem in which each
training instance is given a set of (or distribution over) candidate class
labels and only one of the candidate labels is the correct one. Such a
problem can occur, e.g., in an information retrieval setting where a set of
words is associated with an image, or if classes labels are organized
hierarchically. We propose a novel discriminative approach for handling the
ambiguity of class labels in the training examples. The experiments with the
proposed approach over five different UCI datasets show that our approach is
able to find the correct label among the set of candidate labels and actually
achieve performance close to the case when each training instance is given a
single correct label. In contrast, naïve methods degrade rapidly as more
ambiguity is introduced into the labels.

A. Raval, Zoubin Ghahramani, and David L. Wild.
**A Bayesian
network model for protein fold and remote homologue recognition**.
*Bioinformatics*, 18(6):788-801, 2002.

** Abstract:**
Motivation: The Bayesian network approach is a framework which combines
graphical representation and probability theory, which includes, as a special
case, hidden Markov models. Hidden Markov models trained on amino acid
sequence or secondary structure data alone have been shown to have potential
for addressing the problem of protein fold and superfamily classification.
Results: This paper describes a novel implementation of a Bayesian network
which simultaneously learns amino acid sequence, secondary structure and
residue accessibility for proteins of known three-dimensional structure. An
awareness of the errors inherent in predicted secondary structure may be
incorporated into the model by means of a confusion matrix. Training and
validation data have been derived for a number of protein superfamilies from
the Structural Classification of Proteins (SCOP) database. Cross validation
results using posterior probability classification demonstrate that the
Bayesian network performs better in classifying proteins of known structural
superfamily than a hidden Markov model trained on amino acid sequences
alone.

Naonori Ueda and Zoubin Ghahramani.
**Bayesian model
search for mixture models based on optimizing variational bounds**.
*Neural Networks*, 15(10):1223-1241, 2002.

**
Abstract:** When learning a mixture model, we suffer from the local optima
and model structure determination problems. In this paper, we present a
method for simultaneously solving these problems based on the variational
Bayesian (VB) framework. First, in the VB framework, we derive an objective
function that can simultaneously optimize both model parameter distributions
and model structure. Next, focusing on mixture models, we present a
deterministic algorithm to approximately optimize the objective function by
using the idea of the split and merge operations which we previously proposed
within the maximum likelihood framework. Then, we apply the method to mixture
of expers (MoE) models to experimentally show that the proposed method can
find the optimal number of experts of a MoE while avoiding local maxima. q
2002 Elsevier Science Ltd. All rights reserved.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Occam's
razor**.
In *Advances in Neural Information Processing Systems 13*, pages
294-300, Cambridge, MA, USA, December 2001. The MIT Press.

**
Abstract:** The Bayesian paradigm apparently only sometimes gives rise to
Occam's Razor; at other times very large models perform well. We give simple
examples of both kinds of behaviour. The two views are reconciled when
measuring complexity of functions, rather than of the machinery used to
implement them. We analyze the complexity of functions for some linear in the
parameter models that are equivalent to Gaussian Processes, and always find
Occam's Razor at work.

Zoubin Ghahramani.
**An introduction to
hidden Markov models and Bayesian networks**.
*IJPRAI*, 15(1):9-42, 2001.

** Abstract:** We provide a
tutorial on learning and inference in hidden Markov models in the context of
the recent literature on Bayesian networks. This perspective make sit
possible to consider novel generalizations to hidden Markov models with
multiple hidden state variables, multiscale representations, and mixed
discrete and continuous variables. Although exact inference in these
generalizations is usually intractable, one can use approximate inference in
these generalizations is usually intractable, one can use approximate
inference algorithms such as Markov chain sampling and variational methods.
We describe how such methods are applied to these generalized hidden Markov
models. We conclude this review with a discussion of Bayesian methods for
model selection in generalized HMMs.

Nicholas J. Adams, Amos J. Storkey, Christopher K. I. Williams, and Zoubin
Ghahramani.
**MFDTs: Mean
field dynamic trees**.
In *International Conference on Pattern Recognition*, volume 3,
pages 151-154, 2000.

** Abstract:** Tree structured belief
networks are attractive for image segmentation tasks. However, networks with
fixed architectures are not very suitable as they lead to blocky artefacts,
and led to the introduction of dynamic trees (DTs). The Dynamic trees
architecture provide a prior distribution over tree structures, and simulated
annealing (SA) was used to search for structures with high posterior
probability. In this paper we introduce a mean field approach to inference in
DTs. We find that the mean field method captures the posterior better than
just using the maximum a posteriori solution found by SA

Zoubin Ghahramani and Matthew J. Beal.
**Propagation
algorithms for variational Bayesian learning**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 507-513. The MIT Press, 2000.

** Abstract:**
Variational approximations are becoming a widespread tool for Bayesian
learning of graphical models. We provide some theoretical results for the
variational updates in a very general family of conjugate-exponential
graphical models. We show how the belief propagation and the junction tree
algorithms can be used in the inference step of variational Bayesian
learning. Applying these results to the Bayesian analysis of linear-Gaussian
state-space models we obtain a learning procedure that exploits the Kalman
smoothing propagation, while integrating over all model parameters. We
demonstrate how this can be used to infer the hidden state dimensionality of
the state-space model in a variety of synthetic problems and one real
high-dimensional data set.

Zoubin Ghahramani and Geoffrey E. Hinton.
**Variational
learning for switching state-space models**.
*Neural Computation*, 12(4):831-864, 2000.

**
Abstract:** We introduce a new statistical model for time series which
iteratively segments data into regimes with approximately linear dynamics and
learns the parameters of each of these linear regimes. This model combines
and generalizes two of the most widely used stochastic time series
models-hidden Markov models and linear dynamical systems-and is closely
related to models that are widely used in the control and econometrics
literatures. It can also be derived by extending the mixture of experts
neural network (Jacobs et al, 1991) to its fully dynamical version, in which
both expert and gating networks are recurrent. Inferring the posterior
probabilities of the hidden states of this model is computationally
intractable, and therefore the exact Expectation Maximization (EM) algorithm
cannot be applied. However, we present a variational approximation that
maximizes a lower bound on the log likelihood and makes use of both the
forward-backward recursions for hidden Markov models and the Kalman filter
recursions for linear dynamical systems. We tested the algorithm both on
artificial data sets and on a natural data set of respiration force from a
patient with sleep apnea. The results suggest that variational approximations
are a viable method for inference and learning in switching state-space
models.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**SMEM algorithm
for mixture models**.
*Neural Computation*, 12(9):2109-2128, 2000.

**
Abstract:** We present a split-and-merge expectation-maximization (SMEM)
algorithm to overcome the local maxima problem in parameter estimation of
finite mixture models. In the case of mixture models, local maxima often
involve having too many components of a mixture model in one part of the
space and too few in another, widely separated part of the space. To escape
from such configurations, we repeatedly perform simultaneous split-and-merge
operations using a new criterion for efficiently selecting the
split-and-merge candidates. We apply the proposed algorithm to the training
of gaussian mixtures and mixtures of factor analyzers using synthetic and
real data and show the effectiveness of using the split-and-merge operations
to improve the likelihood of both the training data and of held-out test
data. We also show the practical usefulness of the proposed algorithm by
applying it to image compression and pattern recognition problems.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**Split and merge
EM algorithm for improving Gaussian mixture density estimates**.
*VLSI Signal Processing*, 26(1-2):133-140, 2000.

**
Abstract:** We present a split and merge EM algorithm to overcome the local
maximum problem in Gaussian mixture density estimation. Nonglobal maxims
often involve having too many Gaussians in one part of the space and too few
in another, widely separated part of the space. To escape from such
configurations we repeatedly perform split and merge operations using a new
criterion for efficiently selecting the split and merge candidates.
Experimental results on synthetic and real data show the effectiveness of
using the split and merge operations to improve the likelihood of both the
training data and of held-out test data

Zoubin Ghahramani and Matthew J. Beal.
**Variational
inference for Bayesian mixtures of factor analysers**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 449-455. The MIT Press, 1999.

** Abstract:** We present
an algorithm that infers the model structure of a mixture of factor analysers
using an efficient and deterministic variational approximation to full
Bayesian integration over model parameters. This procedure can automatically
determine the optimal number of components and the local dimensionality of
each component (i.e. the number of factors in each factor analyser).
Alternatively it can be used to infer posterior distributions over number of
components and dimensionalities. Since all parameters are integrated out the
method is not prone to overfitting. Using a stochastic procedure for adding
components it is possible to perform the variational optimisation
incrementally and to avoid local maxima. Results show that the method works
very well in practice and correctly infers the number and dimensionality of
nontrivial synthetic examples. By importance sampling from the variational
approximation we show how to obtain unbiased estimates of the true evidence,
the exact predictive density, and the KL divergence between the variational
posterior and the true posterior, not only in this model but for variational
approximations in general.

Zoubin Ghahramani, Alexander T Korenberg, and Geoffrey E Hinton.
**Scaling in a
hierarchical unsupervised network**.
In *Artificial Neural Networks, 1999. ICANN 99. Ninth International
Conference on (Conf. Publ. No. 470)*, volume 1, pages 13-18. IET,
1999.

** Abstract:** A persistent worry with computational
models of unsupervised learning is that learning will become more difficult
as the problem is scaled. We examine this issue in the context of a novel
hierarchical, generative model that can be viewed as a non-linear
generalization of factor analysis and can be implemented in a neural network.
The model performs perceptual inference in a probabilistically consistent
manner by using top-down, bottom-up and lateral connections. These
connections can be learned using simple rules that require only locally
available information. We first demonstrate that the model can extract a
sparse, distributed, hierarchical representation of global disparity from
simplified random-dot stereograms. We then investigate some of the scaling
properties of the algorithm on this problem and find that : (1) increasing
the image size leads to faster and more reliable learning; (2) Increasing the
depth of the network from one to two hidden layers leads to better
representations at the first hidden layer, and (3) Once one part of the
network has discovered how to represent disparity, it 'supervises' other
parts of the network, greatly speeding up their learning.

Geoffrey E. Hinton, Zoubin Ghahramani, and Yee Whye Teh.
**Learning to
parse images**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 463-469. The MIT Press, 1999.

** Abstract:** We
describe a class of probabilistic models that we call credibility networks.
Using parse trees as internal representations of images, credibility networks
are able to perform segmentation and recognition simultaneously, removing the
need for ad hoc segmentation heuristics. Promising results in the problem of
segmenting handwritten digits were obtained.

Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul.
**An introduction
to variational methods for graphical models**.
*Machine Learning*, 37(2):183-233, 1999.

** Abstract:**
This paper presents a tutorial introduction to the use of variational methods
for inference and learning in graphical models (Bayesian networks and Markov
random fields). We present a number of examples of graphical models,
including the QMR-DT database, the sigmoid belief network, the Boltzmann
machine, and several variants of hidden Markov models, in which it is
infeasible to run exact inference algorithms. We then introduce variational
methods, which exploit laws of large numbers to transform the original
graphical model into a simplified graphical model in which inference is
efficient. Inference in the simpified model provides bounds on probabilities
of interest in the original model. We describe a general framework for
generating variational transformations based on convex duality. Finally we
return to the examples and demonstrate how variational algorithms can be
formulated in each case.

Sam T. Roweis and Zoubin Ghahramani.
**A unifying review
of linear Gaussian models**.
*Neural Computation*, 11(2):305-345, 1999.

**
Abstract:** Factor analysis, principal component analysis, mixtures of
gaussian clusters, vector quantization, Kalman filter models, and hidden
Markov models can all be unified as variations of unsupervised learning under
a single basic generative model. This is achieved by collecting together
disparate observations and derivations made by many previous authors and
introducing a new way of linking discrete and continuous state models using a
simple nonlinearity. Through the use of other nonlinearities, we show how
independent component analysis is also a variation of the same basic
generative model. We show that factor analysis and mixtures of gaussians can
be implemented in autoencoder neural networks and learned using squared error
plus the same regularization term. We introduce a new model for static data,
known as sensible principal component analysis, as well as a novel concept of
spatially adaptive observation noise. We also review some of the literature
involving global and local mixtures of the basic models and provide
pseudocode for inference and learning for all the basic models.

Zoubin Ghahramani and Sam T. Roweis.
**Learning nonlinear
dynamical systems using an EM algorithm**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 431-437. The MIT Press, 1998.

** Abstract:** The
Expectation Maximization (EM) algorithm is an iterative procedure for maximum
likelihood parameter estimation from data sets with missing or hidden
variables. It has been applied to system identification in linear stochastic
state-space models, where the state variables are hidden from the observer
and both the state and the parameters of the model have to be estimated
simultaneously [9]. We present a generalization of the EM algorithm for
parameter estimation in nonlinear dynamical systems. The ``expectation'' step
makes use of Extended Kalman Smoothing to estimate the state, while the
``maximization'' step re-estimates the parameters using these uncertain state
estimates. In general, the nonlinear maximization step is difficult because
it requires integrating out the uncertainty in the states. However, if
Gaussian radial basis function (RBF) approximators are used to model the
nonlinearities, the integrals become tractable and the maximization step can
be solved via systems of linear equations.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton.
**SMEM algorithm
for mixture models**.
In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, *NIPS*,
pages 599-605. The MIT Press, 1998.

** Abstract:** We present
a split-and-merge expectation-maximization (SMEM) algorithm to overcome the
local maxima problem in parameter estimation of finite mixture models. In the
case of mixture models, local maxima often involve having too many components
of a mixture model in one part of the space and too few in another, widely
separated part of the space. To escape from such configurations, we
repeatedly perform simultaneous split-and-merge operations using a new
criterion for efficiently selecting the split-and-merge candidates. We apply
the proposed algorithm to the training of gaussian mixtures and mixtures of
factor analyzers using synthetic and real data and show the effectiveness of
using the split- and-merge operations to improve the likelihood of both the
training data and of held-out test data. We also show the practical
usefulness of the proposed algorithm by applying it to image compression and
pattern recognition problems.

Zoubin Ghahramani and Geoffrey E. Hinton.
**Hierarchical
non-linear factor analysis and topographic maps**.
In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors,
*NIPS*. The MIT Press, 1997.

** Abstract:** We first
describe a hierarchical, generative model that can be viewed as a non-linear
generalisation of factor analysis and can be implemented in a neural network.
The model performs perceptual inference in a probabilistically consistent
manner by using top-down, bottom-up and lateral connections. These
connections can be learned using simple rules that require only locally
available information. We then show how to incorporate lateral connections
into the generative model. The model extracts a sparse, distributed,
hierarchical representation of depth from simplified random-dot stereograms
and the localised disparity detectors in the first hidden layer form a
topographic map. When presented with image patches from natural scenes, the
model develops topographically organised local feature detectors.

Zoubin Ghahramani.
**Learning dynamic
Bayesian networks**.
In C. Lee Giles and Marco Gori, editors, *Adaptive Processing of Sequences
and Data Structures*, volume 1387 of *Lecture Notes in Computer
Science*, pages 168-197. Springer, 1997.

** Abstract:**
Bayesian networks are a concise graphical formalism for describing
probabilistic models. We have provided a brief tutorial of methods for
learning and inference in dynamic Bayesian networks. In many of the
interesting models, beyond the simple linear dynamical system or hidden
Markov model, the calculations required for inference are intractable. Two
different approaches for handling this intractability are Monte Carlo methods
such as Gibbs sampling, and variational methods. An especially promising
variational approach is based on exploiting tractable substructures in the
Bayesian network.

Zoubin Ghahramani and Michael I. Jordan.
**Factorial hidden
Markov models**.
*Machine Learning*, 29(2-3):245-273, 1997.

**
Abstract:** Hidden Markov models (HMMs) have proven to be one of the most
widely used tools for learning probabilistic models of time series data. In
an HMM, information about the past is conveyed through a single discrete
variable-the hidden state. We discuss a generalization of HMMs in which
this state is factored into multiple state variables and is therefore
represented in a distributed manner. We describe an exact algorithm for
inferring the posterior probabilities of the hidden state variables given the
observations, and relate it to the forward-backward algorithm for HMMs and
to algorithms for more general graphical models. Due to the combinatorial
nature of the hidden state representation, this exact algorithm is
intractable. As in other intractable systems, approximate inference can be
carried out using Gibbs sampling or variational methods. Within the
variational framework, we present a structured approximation in which the the
state variables are decoupled, yielding a tractable algorithm for learning
the parameters of the model. Empirical comparisons suggest that these
approximations are efficient and provide accurate alternatives to the exact
methods. Finally, we use the structured approximation to model Bach's
chorales and show that factorial HMMs can capture statistical structure in
this data set which an unconstrained HMM cannot.

Geoffrey E Hinton and Zoubin Ghahramani.
**Generative models
for discovering sparse distributed representations**.
*Philosophical Transactions of the Royal Society of London. Series B:
Biological Sciences*, 352(1358):1177-1190, 1997.

**
Abstract:** We describe a hierarchical, generative model that can be viewed
as a nonlinear generalization of factor analysis and can be implemented in a
neural network. The model uses bottom–up, top–down and lateral
connections to perform Bayesian perceptual inference correctly. Once
perceptual inference has been performed the connection strengths can be
updated using a very simple learning rule that only requires locally
available information. We demonstrate that the network learns to extract
sparse, distributed, hierarchical representations.

David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan.
**Active learning
with statistical models**.
*J. Artif. Intell. Res. (JAIR)*, 4:129-145, 1996.

**
Abstract:** For many types of machine learning algorithms, one can compute
the statistically ``optimal'' way to select training data. In this paper, we
review how optimal data selection techniques have been used with feedforward
neural networks. We then show how the same principles may be used to select
data for two alternative, statistically-based learning architectures:
mixtures of Gaussians and locally weighted regression. While the techniques
for neural networks are computationally expensive and approximate, the
techniques for mixtures of Gaussians and locally weighted regression are both
efficient and accurate. Empirically, we observe that the optimality criterion
sharply decreases the number of training examples the learner needs in order
to achieve good performance.

Michael I. Jordan, Zoubin Ghahramani, and Lawrence K. Saul.
**Hidden Markov
decision trees**.
In Michael Mozer, Michael I. Jordan, and Thomas Petsche, editors,
*NIPS*, pages 501-507. MIT Press, 1996.

** Abstract:**
We study a time series model that can be viewed as a decision tree with
Markov temporal structure. The model is intractable for exact calculations,
thus we utilize variational approximations. We consider three different
distributions for the approximation: one in which the Markov calculations are
performed exactly and the layers of the decision tree are decoupled, one in
which the decision tree calculations are performed exactly and the time steps
of the Markov chain are decoupled, and one in which a Viterbi-like assumption
is made to pick out a single most likely state sequence. We present
simulation results for artificial data and the Bach chorales.

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike
Revow, Zoubin Ghahramani, Rafal Kustra, and Robert Tibshirani.
**The DELVE
manual**, 1996.

** Abstract:** DELVE - Data for
Evaluating Learning in Valid Experiments - is a collection of datasets from
many sources, an environment within which this data can be used to assess the
performance of methods for learning relationships from data, and a repository
for the results of such experiments.

** Comment:** The delve website.

Zoubin Ghahramani and Michael I. Jordan.
**Factorial hidden
Markov models**.
In David S. Touretzky, Michael Mozer, and Michael E. Hasselmo, editors,
*NIPS*, pages 472-478. MIT Press, 1995.

** Abstract:**
Hidden Markov models (HMMs) have proven to be one of the most widely used
tools for learning probabilistic models of time series data. In an HMM,
information about the past is conveyed through a single discrete
variable-the hidden state. We discuss a generalization of HMMs in which
this state is factored into multiple state variables and is therefore
represented in a distributed manner. We describe an exact algorithm for
inferring the posterior probabilities of the hidden state variables given the
observations, and relate it to the forward-backward algorithm for HMMs and
to algorithms for more general graphical models. Due to the combinatorial
nature of the hidden state representation, this exact algorithm is
intractable. As in other intractable systems, approximate inference can be
carried out using Gibbs sampling or variational methods. Within the
variational framework, we present a structured approximation in which the the
state variables are decoupled, yielding a tractable algorithm for learning
the parameters of the model. Empirical comparisons suggest that these
approximations are efficient and provide accurate alternatives to the exact
methods. Finally, we use the structured approximation to model Bach's
chorales and show that factorial HMMs can capture statistical structure in
this data set which an unconstrained HMM cannot.

David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan.
**Active learning
with statistical models**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 705-712. MIT Press,
1994.

** Abstract:** For many types of machine learning
algorithms, one can compute the statistically ``optimal'' way to select
training data. In this paper, we review how optimal data selection techniques
have been used with feedforward neural networks. We then show how the same
principles may be used to select data for two alternative,
statistically-based learning architectures: mixtures of Gaussians and locally
weighted regression. While the techniques for neural networks are
computationally expensive and approximate, the techniques for mixtures of
Gaussians and locally weighted regression are both efficient and accurate.
Empirically, we observe that the optimality criterion sharply decreases the
number of training examples the learner needs in order to achieve good
performance.

Zoubin Ghahramani.
**Factorial learning and
the EM algorithm**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 617-624. MIT Press,
1994.

** Abstract:** Many real world learning problems are
best characterized by an interaction of multiple independent causes or
factors. Discovering such causal structure from the data is the focus of this
paper. Based on Zemel and Hinton's cooperative vector quantizer (CVQ)
architecture, an unsupervised learning algorithm is derived from the
Expectation-Maximization (EM) framework. Due to the combinatorial nature of
the data generation process, the exact E-step is computationally intractable.
Two alternative methods for computing the E-step are proposed: Gibbs sampling
and mean-field approximation, and some promising empirical results are
presented.

Zoubin Ghahramani, Daniel M. Wolpert, and Michael I. Jordan.
**Computational
structure of coordinate transformations: A generalization study**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 1125-1132. MIT Press,
1994.

** Abstract:** One of the fundamental properties that
both neural networks and the central nervous system share is the ability to
learn and generalize from examples. While this property has been studied
extensively in the neural network literature it has not been thoroughly
explored in human perceptual and motor learning. We have chosen a coordinate
transformation system-the visuomotor map which transforms visual
coordinates into motor coordinates-to study the generalization effects of
learning new input-output pairs. Using a paradigm of computer controlled
altered visual feedback, we have studied the generalization of the visuomotor
map subsequent to both local and context-dependent remappings. A local
remapping of one or two input-output pairs induced a significant global, yet
decaying, change in the visuomotor map, suggesting a representation for the
map composed of units with large functional receptive fields. Our study of
context-dependent remappings indicated that a single point in visual space
can be mapped to two different finger locations depending on a context
variable-the starting point of the movement. Furthermore, as the context is
varied there is a gradual shift between the two remappings, consistent with
two visuomotor modules being learned and gated smoothly with the context.

Daniel M. Wolpert, Zoubin Ghahramani, and Michael I. Jordan.
**Forward dynamic
models in human motor control: Psychophysical evidence**.
In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, *Advances
in Neural Information Processing Systems 7*, pages 43-50. MIT Press,
1994.

** Abstract:** Based on computational principles, with
as yet no direct experimental validation, it has been proposed that the
central nervous system (CNS) uses an internal model to simulate the dynamic
behavior of the motor system in planning, control and learning. We present
experimental results and simulations based on a novel approach that
investigates the temporal propagation of errors in the sensorimotor
integration process. Our results provide direct support for the existence of
an internal model.

Zoubin Ghahramani and Michael I. Jordan.
**Supervised learning
from incomplete data via an EM approach**.
In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, *NIPS*,
pages 120-127. Morgan Kaufmann, 1993.

** Abstract:**
Real-world learning tasks may involve high-dimensional data sets with
arbitrary patterns of missing data. In this paper we present a framework
based on maximum likelihood density estimation for learning from such data
sets. We use mixture models for the density estimates and make two distinct
appeals to the ExpectationMaximization (EM) principle (Dempster et al., 1977)
in deriving a learning algorithm-EM is used both for the estimation of
mixture components and for coping with missing data. The resulting algorithm
is applicable to a wide range of supervised as well as unsupervised learning
problems. Results from a classification benchmark-the iris data set-are
presented.

Adrian Goldwaser and Hong Ge.
**Learning deep neural networks
through iterative linearisation**.
In *Neurips 2022 Workshop Optimisation in Machine Learning*, 2022.

** Abstract:** The excellent real-world performance of deep
neural networks has received increasing attention. Despite the capacity to
overfit significantly, such large models work better than smaller ones. This
phenomenon is often referred to as the scaling law by practitioners. It is of
fundamental interest to study why the scaling law exists and how it
avoids/controls overfitting. One approach has been looking at infinite width
limits of neural networks (e.g., Neural Tangent Kernels, Gaussian Processes);
however, in practise, these do not fully explain finite networks as their
infinite counterparts do not learn features. Furthermore, the empirical
kernel for finite networks (i.e., the inner product of feature vectors),
changes significantly during training in contrast to infinite width networks.
In this work we derive an iterative linearised training method. We justify
iterative lineralisation as an interpolation between finite analogs of the
infinite width regime, which do not learn features, and standard gradient
descent training which does. We show some preliminary results where iterative
linearised training works well, noting in particular how much feature
learning is required to achieve comparable performance. We also provide novel
insights into the training behaviour of neural networks.

Wessel P. Bruinsma, James Requeima, Andrew Y. K. Foong, Jonathan Gordon, and
Richard E. Turner.
**The Gaussian neural
process**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Neural Processes (NPs; Garnelo et al.,
2018a,b) are a rich class of models for meta-learning that map data sets
directly to predictive stochastic processes. We provide a rigorous analysis
of the standard maximum-likelihood objective used to train conditional NPs.
Moreover, we propose a new member to the Neural Process family called the
Gaussian Neural Process (GNP), which models predictive correlations,
incorporates translation equivariance, provides universal approximation
guarantees, and demonstrates encouraging performance.

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann
Dubois, and Richard Turner.
**Convolutional
conditional neural processes**.
In *8th International Conference on Learning Representations*, Adis
Ababa, April 2020.

** Abstract:** We introduce the
Convolutional Conditional Neural Process (ConvCNP), a new member of the
Neural Process family that models translation equivariance in the data.
Translation equivariance is an important inductive bias for many learning
problems including time series modelling, spatial data, and images. The model
embeds data sets into an infinite-dimensional function space, as opposed to
finite-dimensional vector spaces. To formalize this notion, we extend the
theory of neural representations of sets to include functional
representations, and demonstrate that any translation-equivariant embedding
can be represented using a convolutional deep-set. We evaluate ConvCNPs in
several settings, demonstrating that they achieve state-of-the-art
performance compared to existing NPs. We demonstrate that building in
translation equivariance enables zero-shot generalization to challenging,
out-of-domain tasks.

John Bronskill, Jonathan Gordon, James Requeima, Sebastian Nowozin, and
Richard E. Turner.
**TaskNorm:
rethinking batch normalization for meta-learning**.
In *37th International Conference on Machine Learning*. Proceedings of
Machine Learning Research, 2020.

** Abstract:** Modern
meta-learning approaches for image classification rely on increasingly deep
networks to achieve state-of-the-art performance, making batch normalization
an essential component of meta-learning pipelines. However, the hierarchical
nature of the meta-learning setting presents several challenges that can
render conventional batch normalization ineffective, giving rise to the need
to rethink normalization in this setting. We evaluate a range of approaches
to batch normalization for meta-learning scenarios, and develop a novel
approach that we call TASKNORM. Experiments on fourteen datasets demonstrate
that the choice of batch normalization has a dramatic effect on both
classification accuracy and training time for both gradient based and
gradient free meta-learning approaches. Importantly, TASKNORM is found to
consistently improve performance. Finally, we provide a set of best practices
for normalization that will allow fair comparison of meta-learning
algorithms.

Andrew Y. K. Foong, Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, James
Requeima, and Richard E. Turner.
**Meta-learning
stationary stochastic process prediction with convolutional neural
processes**.
In *Advances in Neural Information Processing Systems 33*. Curran
Associates, Inc., 2020.

** Abstract:** Stationary stochastic
processes (SPs) are a key component of many probabilistic models, such as
those for off-the-grid spatio-temporal data. They enable the statistical
symmetry of underlying physical phenomena to be leveraged, thereby aiding
generalization. Prediction in such models can be viewed as a translation
equivariant map from observed data sets to predictive SPs, emphasizing the
intimate relationship between stationarity and equivariance. Building on
this, we propose the Convolutional Neural Process (ConvNP), which endows
Neural Processes (NPs) with translation equivariance and extends
convolutional conditional NPs to allow for dependencies in the predictive
distribution. The latter enables ConvNPs to be deployed in settings which
require coherent samples, such as Thompson sampling or conditional image
completion. Moreover, we propose a new maximum-likelihood objective to
replace the standard ELBO objective in NPs, which conceptually simplifies the
framework and empirically improves performance. We demonstrate the strong
performance and generalization capabilities of ConvNPs on 1D regression,
image completion, and various tasks with real-world spatio-temporal data.

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and
Richard Turner.
**Meta-learning
probabilistic inference for prediction**.
In *7th International Conference on Learning Representations*, New
Orleans, April 2019.

** Abstract:** This paper introduces a
new framework for data efficient and versatile learning. Specifically: 1) We
develop ML-PIP, a general framework for Meta-Learning approximate
Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic
interpretations of meta-learning to cover a broad class of methods. 2) We
introduce \Versa, an instance of the framework employing a flexible and
versatile amortization network that takes few-shot learning datasets as
inputs, with arbitrary numbers of shots, and outputs a distribution over
task-specific parameters in a single forward pass. Versa substitutes
optimization at test time with forward passes through inference networks,
amortizing the cost of inference and relieving the need for second
derivatives during training. 3) We evaluate \Versa on benchmark datasets
where the method sets new state-of-the-art results, and can handle arbitrary
number of shots, and for classification, arbitrary numbers of classes at
train and test time. The power of the approach is then demonstrated through a
challenging few-shot ShapeNet view reconstruction task.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, and Jose Miguel
Hernández-Lobato.
**Bayesian
batch active learning as sparse subset approximation**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** Leveraging the wealth of unlabeled data produced
in recent years provides great potential for improving supervised models.
When the cost of acquiring labels is high, probabilistic active learning
methods can be used to greedily select the most informative data points to be
labeled. However, for many large-scale problems standard greedy procedures
become computationally infeasible and suffer from negligible model change. In
this paper, we introduce a novel Bayesian batch active learning approach that
mitigates these issues. Our approach is motivated by approximating the
complete data posterior of the model parameters. While naive batch
construction methods result in correlated queries, our algorithm produces
diverse batches that enable efficient active learning at scale. We derive
interpretable closed-form solutions akin to existing active learning
procedures for linear models, and generalize to arbitrary models using random
projections. We demonstrate the benefits of our approach on several
large-scale regression and classification tasks.

James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and
Richard E. Turner.
**Fast
and flexible multi-task classification using conditional neural adaptive
processes**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** The goal of this paper is to design image
classification systems that, after an initial multi-task training phase, can
automatically adapt to new tasks encountered at test time. We introduce a
conditional neural process based approach to the multi-task classification
setting for this purpose, and establish connections to the meta- and few-shot
learning literature. The resulting approach, called CNAPs, comprises a
classifier whose parameters are modulated by an adaptation network that takes
the current task's dataset as input. We demonstrate that CNAPs achieves
state-of-the-art results on the challenging Meta-Dataset benchmark indicating
high-quality transfer-learning. We show that the approach is robust, avoiding
both over-fitting in low-shot regimes and under-fitting in high-shot regimes.
Timing experiments reveal that CNAPs is computationally efficient at
test-time as it does not involve gradient based adaptation. Finally, we show
that trained models are immediately deployable to continual learning and
active learning where they can outperform existing approaches that do not
leverage transfer learning.

Jan-Peter Calliess, Nathan Korda, and Geoffrey J. Gordon.
**A distributed
mechanism for multi-agent convex optimisation and coordination with no-regret
learners**.
In *Workshop on Learning, Inference and Control of Multi-Agent Systems,
NIPS*, Barcelona, Spain, December 2016.

** Abstract:** We
develop an indirect mechanism for coordinated, distributed multi-agent
optimisation, and decision-making. Our approach extends previous work in
no-regret learning based mechanism design and renders it applicable to
partial information settings. We consider planning problems that can be
stated as a collection of single-agent convex programmes coupled by common
soft constraints. A key idea is to recast the joint optimisation problem as
distributed learning in a repeated game between the original agents and a
newly introduced group of adversarial agents who influence prices for
decisions and facilitate coordination. Under the weak behavioural assumption
that all agents employ selfish, sub-linear regret algorithms in the course of
the repeated game, we guarantee that our mechanism can achieve design goals
such as social optimality (efficiency) and Nash-equilibrium convergence to
within an error which approaches zero as the agents gain experience. Our
error bounds are deterministic or probabilistic, depending on the nature of
the regret bounds available for the algorithms employed by the agents. We
illustrate our method in an emissions market application.

Johannes Borgström, Andrew D. Gordon, Long Ouyang, Claudio Russo, Adam
Ścibior, and Marcin Szymczak.
**Fabular:
Regression formulas as probabilistic programming**.
In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages*, POPL 2016, pages 271-283, New
York, NY, USA, 2016. acm, doi
10.1145/2837614.2837653.

** Abstract:** Regression
formulas are a domain-specific language adopted by several R packages for
describing an important and useful class of statistical models: hierarchical
linear regressions. Formulas are succinct, expressive, and clearly popular,
so are they a useful addition to probabilistic programming languages? And
what do they mean? We propose a core calculus of hierarchical linear
regression, in which regression coefficients are themselves defined by nested
regressions (unlike in R). We explain how our calculus captures the essence
of the formula DSL found in R. We describe the design and implementation of
Fabular, a version of the Tabular schema-driven probabilistic programming
language, enriched with formulas based on our regression calculus. To the
best of our knowledge, this is the first formal description of the core ideas
of R's formula notation, the first development of a calculus of regression
formulas, and the first demonstration of the benefits of composing regression
formulas and latent variables in a probabilistic programming language.

Adam Ścibior, Zoubin Ghahramani, and Andrew D. Gordon.
**Practical
probabilistic programming with monads**.
In *Proceedings of the 8th ACM SIGPLAN Symposium on Haskell*.
Association for Computing Machinery, 2015, doi
10.1145/2804302.2804317.

** Abstract:** The machine
learning community has recently shown a lot of interest in practical
probabilistic programming systems that target the problem of Bayesian
inference. Such systems come in different forms, but they all express
probabilistic models as computational processes using syntax resembling
programming languages. In the functional programming community monads are
known to offer a convenient and elegant abstraction for programming with
probability distributions, but their use is often limited to very simple
inference problems. We show that it is possible to use the monad abstraction
to construct probabilistic models for machine learning, while still offering
good performance of inference in challenging models. We use a GADT as an
underlying representation of a probability distribution and apply Sequential
Monte Carlo-based methods to achieve efficient inference. We define a formal
semantics via measure theory. We demonstrate a clean and elegant
implementation that achieves performance comparable with Anglican, a
state-of-the-art probabilistic programming system.

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani.
**Student-t
processes as alternatives to Gaussian processes**.
In *AISTATS*, JMLR Proceedings. JMLR.org, 2014.

**
Abstract:** We investigate the Student-t process as an alternative to the
Gaussian process as a nonparametric prior over functions. We derive closed
form expressions for the marginal likelihood and predictive distribution of a
Student-t process, by integrating away an inverse Wishart process prior over
the covariance kernel of a Gaussian process model. We show surprising
equivalences between different hierarchical Gaussian process models leading
to Student-t processes, and derive a new sampling scheme for the inverse
Wishart process, which helps elucidate these equivalences. Overall, we show
that a Student-t process can retain the attractive properties of a Gaussian
process - a nonparametric representation, analytic marginal and predictive
distributions, and easy model selection through covariance kernels - but has
enhanced flexibility, and predictive covariances that, unlike a Gaussian
process, explicitly depend on the values of training observations. We verify
empirically that a Student-t process is especially useful in situations where
there are changes in covariance structure, or in applications like Bayesian
optimization, where accurate predictive covariances are critical for good
performance. These advantages come at no additional computational cost over
Gaussian processes.

Andrew Gordon Wilson.
**Covariance
Kernels for Fast Automatic Pattern Discovery and Extrapolation with
Gaussian Processes**.
PhD thesis, University of Cambridge, Cambridge, UK, 2014.

**
Abstract:** Truly intelligent systems are capable of pattern discovery and
extrapolation without human intervention. Bayesian nonparametric models,
which can uniquely represent expressive prior information and detailed
inductive biases, provide a distinct opportunity to develop intelligent
systems, with applications in essentially any learning and prediction
task.

Gaussian processes are rich distributions over functions, which
provide a Bayesian nonparametric approach to smoothing and interpolation. A
covariance kernel determines the support and inductive biases of a Gaussian
process. In this thesis, we introduce new covariance kernels to enable fast
automatic pattern discovery and extrapolation with Gaussian processes.

In
the introductory chapter, we discuss the high level principles behind all of
the models in this thesis: 1) we can typically improve the predictive
performance of a model by accounting for additional structure in data; 2) to
automatically discover rich structure in data, a model must have large
support and the appropriate inductive biases; 3) we most need expressive
models for large datasets, which typically provide more information for
learning structure, and 4) we can often exploit the existing inductive biases
(assumptions) or structure of a model for scalable inference, without the
need for simplifying assumptions.

In the context of this introduction, we
then discuss, in chapter 2, Gaussian processes as kernel machines, and my
views on the future of Gaussian process research.

In chapter 3 we
introduce the Gaussian process regression network (GPRN) framework, a
multi-output Gaussian process method which scales to many output variables,
and accounts for input-dependent correlations between the outputs. Underlying
the GPRN is a highly expressive kernel, formed using an adaptive mixture of
latent basis functions in a neural network like architecture. The GPRN is
capable of discovering expressive structure in data. We use the GPRN to model
the time-varying expression levels of 1000 genes, the spatially varying
concentrations of several distinct heavy metals, and multivariate volatility
(input dependent noise covariances) between returns on equity indices and
currency exchanges, which is particularly valuable for portfolio allocation.
We generalise the GPRN to an adaptive network framework, which does not
depend on Gaussian processes or Bayesian nonparametrics; and we outline
applications for the adaptive network in nuclear magnetic resonance (NMR)
spectroscopy, ensemble learning, and change-point modelling.

In chapter 4
we introduce simple closed form kernel for automatic pattern discovery and
extrapolation. These spectral mixture (SM) kernels are derived by modelling
the spectral densiy of a kernel (its Fourier transform) using a
scale-location Gaussian mixture. SM kernels form a basis for all stationary
covariances, and can be used as a drop-in replacement for standard kernels,
as they retain simple and exact learning and inference procedures. We use the
SM kernel to discover patterns and perform long range extrapolation on
atmospheric CO2 trends and airline passenger data, as well as on synthetic
examples. We also show that the SM kernel can be used to automatically
reconstruct several standard covariances. The SM kernel and the GPRN are
highly complementary; we show that using the SM kernel with adaptive basis
functions in a GPRN induces an expressive prior over non-stationary
kernels.

In chapter 5 we introduce GPatt, a method for fast
multidimensional pattern extrapolation, particularly suited to imge and movie
data. Without human intervention - no hand crafting of kernel features, and
no sophisticated initialisation procedures - we show that GPatt can solve
large scale pattern extrapolation, inpainting and kernel discovery problems,
including a problem with 383,400 training points. GPatt exploits the
structure of a spectral mixture product (SMP) kernel, for fast yet exact
inference procedures. We find that GPatt significantly outperforms popular
alternative scalable gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits existing model structure are useful in combination
for modelling large scale multidimensional patterns.

The models in this
dissertation have proven to be scalable and with greatly enhanced predictive
performance over the alternatives: the extra structure being modelled is an
important part of a wide variety of real data - including problems in
econometrics, gene expression, geostatistics, nuclear magnetic resonance
spectroscopy, ensemble learning, multi-output regression, change point
modelling, time series, multivariate volatility, image inpainting, texture
extrapolation, video extrapolation, acoustic modelling, and kernel
discovery.

Andrew Gordon Wilson, Yuting Wu, Daniel J. Holland, Sebastian Nowozin,
Mick D. Mantle, Lynn F. Gladden, and Andrew Blake.
**Bayesian inference for NMR
spectroscopy with applications to chemical quantification**.
*arXiv preprint arXiv 1402.3580*, 2014.

** Abstract:**
Nuclear magnetic resonance (NMR) spectroscopy exploits the magnetic
properties of atomic nuclei to discover the structure, reaction state and
chemical environment of molecules. We propose a probabilistic generative
model and inference procedures for NMR spectroscopy. Specifically, we use a
weighted sum of trigonometric functions undergoing exponential decay to model
free induction decay (FID) signals. We discuss the challenges in estimating
the components of this general model - amplitudes, phase shifts,
frequencies, decay rates, and noise variances - and offer practical
solutions. We compare with conventional Fourier transform spectroscopy for
estimating the relative concentrations of chemicals in a mixture, using
synthetic and experimentally acquired FID signals. We find the proposed model
is particularly robust to low signal to noise ratios (SNR), and overlapping
peaks in the Fourier transform of the FID, enabling accurate predictions
(e.g., 1% error at low SNR) which are not possible with conventional
spectroscopy (5% error).

Andrew Gordon Wilson and Ryan Prescott Adams.
**Gaussian
process kernels for pattern discovery and extrapolation**.
In *30th International Conference on Machine Learning*, February 18
2013.

** Abstract:** Gaussian processes are rich distributions
over functions, which provide a Bayesian nonparametric approach to smoothing
and interpolation. We introduce simple closed form kernels that can be used
with Gaussian processes to discover patterns and enable extrapolation. These
kernels are derived by modelling a spectral density - the Fourier transform
of a kernel - with a Gaussian mixture. The proposed kernels support a broad
class of stationary covariances, but Gaussian process inference remains
simple and analytic. We demonstrate the proposed kernels by discovering
patterns and performing long range extrapolation on synthetic examples, as
well as atmospheric CO2 trends and airline passenger data. We also show that
we can reconstruct standard covariances within our framework.

** Comment:** arXiv:1302.4245

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham.
**Gpatt: Fast multidimensional
pattern extrapolation with Gaussian processes**.
*arXiv preprint arXiv:1310.5288*, 2013.

** Abstract:**
Gaussian processes are typically used for smoothing and interpolation on
small datasets. We introduce a new Bayesian nonparametric framework - GPatt
- enabling automatic pattern extrapolation with Gaussian processes on large
multidimensional datasets. GPatt unifies and extends highly expressive
kernels and fast exact inference techniques. Without human intervention - no
hand crafting of kernel features, and no sophisticated initialisation
procedures - we show that GPatt can solve large scale pattern extrapolation,
inpainting, and kernel discovery problems, including a problem with 383,400
training points. We find that GPatt significantly outperforms popular
alternative scalable Gaussian process methods in speed and accuracy.
Moreover, we discover profound differences between each of these methods,
suggesting expressive kernels, nonparametric representations, and scalable
inference which exploits model structure are useful in combination for
modelling large scale multidimensional patterns.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Modelling input
varying correlations between multiple responses**.
In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors,
*ECML/PKDD*, volume 7524 of *Lecture Notes in Computer
Science*, pages 858-861. Springer, 2012.

** Abstract:**
We introduced a generalised Wishart process (GWP) for modelling input
dependent covariance matrices Σ(x), allowing one to model input varying
correlations and uncertainties between multiple response variables. The GWP
can naturally scale to thousands of response variables, as opposed to
competing multivariate volatility models which are typically intractable for
greater than 5 response variables. The GWP can also naturally capture a rich
class of covariance dynamics - periodicity, Brownian motion, smoothness,
- through a covariance kernel.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary
Material, slides.

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee.
**The block
diagonal infinite hidden Markov model**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 552-559,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** The Infinite Hidden Markov Model
(IHMM) extends hidden Markov models to have a countably infinite number of
hidden states (Beal et al., 2002; Teh et al.,
2006). We present a generalization of this framework that introduces nearly
block-diagonal structure in the transitions between the hidden states, where
blocks correspond to "sub-behaviors" exhibited by data sequences. In
identifying such structure, the model classifies, or partitions, sequence
data according to these sub-behaviors in an unsupervised way. We present an
application of this model to artificial data, a video gesture classification
task, and a musical theme labeling task, and show that components of the
model can also be applied to graph segmentation.

Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine.
**Leave no trace: Learning to reset
for safe and autonomous reinforcement learning**.
In *6th International Conference on Learning Representations*, Vancouver
CANADA, Apr 2018.

** Abstract:** Deep reinforcement learning
algorithms can learn complex behavioral skills, but real-world application of
these methods requires a large amount of experience to be collected by the
agent. In practical settings, such as robotics, this involves repeatedly
attempting a task, resetting the environment between each attempt. However,
not all tasks are easily or automatically reversible. In practice, this
learning process requires extensive human intervention. In this work, we
propose an autonomous method for safe and efficient reinforcement learning
that simultaneously learns a forward and reset policy, with the reset policy
resetting the environment for a subsequent attempt. By learning a value
function for the reset policy, we can automatically determine when the
forward policy is about to enter a non-reversible state, providing for
uncertainty-aware safety aborts. Our experiments illustrate that proper use
of the reset policy can greatly reduce the number of manual resets required
to learn a task, can reduce the number of unsafe actions that lead to
non-reversible states, and can automatically induce a curriculum.

** Comment:** [Video]

Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine.
**Temporal
difference models: Model-free deep rl for model-based control**.
In *6th International Conference on Learning Representations*, Vancouver
CANADA, Apr 2018.

** Abstract:** Model-free reinforcement
learning (RL) has been proven to be a powerful, general tool for learning
complex behaviors. However, its sample efficiency is often impractically
large for solving challenging real-world problems, even for off-policy
algorithms such as Q-learning. A limiting factor in classic model-free RL is
that the learning signal consists only of scalar rewards, ignoring much of
the rich information contained in state transition tuples. Model-based RL
uses this information, by training a predictive model, but often does not
achieve the same asymptotic performance as model-free RL due to model bias.
We introduce temporal difference models (TDMs), a family of goal-conditioned
value functions that can be trained with model-free learning and used for
model-based control. TDMs combine the benefits of model-free and model-based
RL: they leverage the rich information in state transitions to learn very
efficiently, while still attaining asymptotic performance that exceeds that
of direct model-based RL methods. Our experimental results show that, on a
range of continuous control tasks, TDMs provide a substantial improvement in
efficiency compared to state-of-the-art model-based and model-free
methods.

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard
Schölkopf, and Sergey Levine.
**Interpolated policy gradient:
Merging on-policy and off-policy policy gradient estimation for deep
reinforcement learning**.
In *Advances in Neural Information Processing Systems 31*, Long Beach
USA, Dec 2017.

** Abstract:** Off-policy model-free deep
reinforcement learning methods using previously collected data can improve
sample efficiency over on-policy policy gradient techniques. On the other
hand, on-policy algorithms are often more stable and easier to use. This
paper examines, both theoretically and empirically, approaches to merging on-
and off-policy updates for deep reinforcement learning. Theoretical results
show that off-policy updates with a value function estimator can be
interpolated with on-policy policy gradient updates whilst still satisfying
performance bounds. Our analysis uses control variate methods to produce a
family of policy gradient algorithms, with several recently proposed
algorithms being special cases of this family. We then provide an empirical
comparison of these techniques with the remaining algorithmic details fixed,
and show how different mixing of off-policy gradient estimates with on-policy
samples contribute to improvements in empirical performance. The final
algorithm provides a generalization and unification of existing deep policy
gradient techniques, has theoretical guarantees on the bias introduced by
off-policy updates, and improves on the state-of-the-art model-free deep RL
methods on a number of OpenAI Gym continuous control benchmarks.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine.
**Deep reinforcement learning for
robotic manipulation with asynchronous off-policy updates**.
In *IEEE International Conference on Robotics and Automation*,
SINGAPORE, May 2017.

** Abstract:** Reinforcement learning
holds the promise of enabling autonomous robots to learn large repertoires of
behavioral skills with minimal human intervention. However, robotic
applications of reinforcement learning often compromise the autonomy of the
learning process in favor of achieving training times that are practical for
real physical systems. This typically involves introducing hand-engineered
policy representations and human-supplied demonstrations. Deep reinforcement
learning alleviates this limitation by training general-purpose neural
network policies, but applications of direct deep reinforcement learning
algorithms have so far been restricted to simulated settings and relatively
simple tasks, due to their apparent high sample complexity. In this paper, we
demonstrate that a recent deep reinforcement learning algorithm based on
off-policy training of deep Q-functions can scale to complex 3D manipulation
tasks and can learn deep neural network policies efficiently enough to train
on real physical robots. We demonstrate that the training times can be
further reduced by parallelizing the algorithm across multiple robots which
pool their policy updates asynchronously. Our experimental evaluation shows
that our method can learn a variety of 3D manipulation skills in simulation
and a complex door opening skill on real robots without any prior
demonstrations or manually designed representations.

** Comment:** [Google
Blogpost] [MIT
Technology Review] [Video]

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and
Sergey Levine.
**Q-prop: Sample-efficient policy
gradient with an off-policy critic**.
In *5th International Conference on Learning Representations*, Toulon
France, April 2017.

** Abstract:** Model-free deep
reinforcement learning (RL) methods have been successful in a wide variety of
simulated domains. However, a major obstacle facing deep RL in the real world
is their high sample complexity. Batch policy gradient methods offer stable
learning, but at the cost of high variance, which often requires large
batches. TD-style methods, such as off-policy actor-critic and Q-learning,
are more sample-efficient but biased, and often require costly hyperparameter
sweeps to stabilize. In this work, we aim to develop methods that combine the
stability of policy gradients with the efficiency of off-policy RL. We
present Q-Prop, a policy gradient method that uses a Taylor expansion of the
off-policy critic as a control variate. Q-Prop is both sample efficient and
stable, and effectively combines the benefits of on-policy and off-policy
methods. We analyze the connection between Q-Prop and existing model-free
algorithms, and use control variate theory to derive two variants of Q-Prop
with conservative and aggressive adaptation. We show that conservative Q-Prop
provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the
state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo
continuous control environments.

Eric Jang, Shixiang Gu, and Ben Poole.
**Categorical reparametrization
with gumble-softmax**.
In *5th International Conference on Learning Representations*, Toulon
FRANCE, April 2017.

** Abstract:** Categorical variables are a
natural choice for representing discrete structure in the world. However,
stochastic neural networks rarely use categorical latent variables due to the
inability to backpropagate through samples. In this work, we present an
efficient gradient estimator that replaces the non-differentiable sample from
a categorical distribution with a differentiable sample from a novel
Gumbel-Softmax distribution. This distribution has the essential property
that it can be smoothly annealed into a categorical distribution. We show
that our Gumbel-Softmax estimator outperforms state-of-the-art gradient
estimators on structured output prediction and unsupervised generative
modeling tasks with categorical latent variables, and enables large speedups
on semi-supervised classification.

Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine.
**Continuous deep q-learning with
model-based acceleration**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Model-free reinforcement learning
has been successfully applied to a range of challenging problems, and has
recently been extended to handle large neural network policies and value
functions. However, the sample complexity of model-free algorithms,
particularly when using high-dimensional function approximators, tends to
limit their applicability to physical systems. In this paper, we explore
algorithms and representations to reduce the sample complexity of deep
reinforcement learning for continuous control tasks. We propose two
complementary techniques for improving the efficiency of such algorithms.
First, we derive a continuous variant of the Q-learning algorithm, which we
call normalized adantage functions (NAF), as an alternative to the more
commonly used policy gradient and actor-critic methods. NAF representation
allows us to apply Q-learning with experience replay to continuous tasks, and
substantially improves performance on a set of simulated robotic control
tasks. To further improve the efficiency of our approach, we explore the use
of learned models for accelerating model-free reinforcement learning. We show
that iteratively refitted local linear models are especially effective for
this, and demonstrate substantially faster learning on domains where such
models are applicable.

Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih.
**Muprop: Unbiased backpropagation
for stochastic neural networks**.
In *4th International Conference on Learning Representations*, San Juan
PUERTO RICO, May 2016.

** Abstract:** Deep neural networks are
powerful parametric models that can be trained efficiently using the
backpropagation algorithm. Stochastic neural networks combine the power of
large parametric functions with that of graphical models, which makes it
possible to learn very complex distributions. However, as backpropagation is
not directly applicable to stochastic networks that include discrete sampling
operations within their computational graph, training such networks remains
difficult. We present MuProp, an unbiased gradient estimator for stochastic
networks, designed to make this task easier. MuProp improves on the
likelihood-ratio estimator by reducing its variance using a control variate
based on the first-order Taylor expansion of a mean-field network. Crucially,
unlike prior attempts at using backpropagation for training stochastic
networks, the resulting estimator is unbiased and well behaved. Our
experiments on structured output prediction and discrete latent variable
modeling demonstrate that MuProp yields consistently good performance across
a range of difficult tasks.

Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner.
**Neural adaptive sequential Monte
Carlo**.
In *Advances in Neural Information Processing Systems 29*, Montréal
CANADA, Dec 2015.

** Abstract:** Sequential Monte Carlo (SMC),
or particle filtering, is a popular class of methods for sampling from an
intractable target distribution using a sequence of simpler intermediate
distributions. Like other importance sampling-based methods, performance is
critically dependent on the proposal distribution: a bad proposal can lead to
arbitrarily inaccurate estimates of the target distribution. This paper
presents a new method for automatically adapting the proposal using an
approximation of the Kullback-Leibler divergence between the true posterior
and the proposal distribution. The method is very flexible, applicable to any
parameterised proposal distribution and it supports online and batch
variants. We use the new framework to adapt powerful proposal distributions
with rich parameterisations based upon neural networks leading to Neural
Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC
significantly improves inference in a non-linear state space model
outperforming adaptive proposal methods including the Extended Kalman and
Unscented Particle Filters. Experiments also indicate that improved inference
translates into improved parameter learning when NASMC is used as a
subroutine of Particle Marginal Metropolis Hastings. Finally we show that
NASMC is able to train a neural network-based deep recurrent generative model
achieving results that compete with the state-of-the-art for polymorphic
music modelling. NASMC can be seen as bridging the gap between adaptive SMC
methods and the recent work in scalable, black-box variational inference.

Nilesh Tripuraneni, Shixiang Gu, Hong Ge, and Zoubin Ghahramani.
**Particle
gibbs for infinite hidden markov models**.
In *Advances in Neural Information Processing Systems 29*, Montreal
CANADA, May 2015.

** Abstract:** Infinite Hidden Markov Models
(iHMM’s) are an attractive, nonparametric gener- alization of the classical
Hidden Markov Model which can automatically infer the number of hidden states
in the system. However, due to the infinite-dimensional nature of the
transition dynamics, performing inference in the iHMM is difficult. In this
paper, we present an infinite-state Particle Gibbs (PG) algorithm to re-
sample state trajectories for the iHMM. The proposed algorithm uses an
efficient proposal optimized for iHMMs and leverages ancestor sampling to
improve the mixing of the standard PG algorithm. Our algorithm demonstrates
significant con- vergence improvements on synthetic and real world data
sets.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani and Daniel M. Roy.
**The combinatorial structure of beta
negative binomial processes**.
Technical report, Dept. of Engineering, University of Cambridge, March 2014.

** Abstract:** We characterize the combinatorial structure of
conditionally-i.i.d. sequences of negative binomial processes with a common
beta process base measure. In Bayesian nonparametric applications, such
processes have served as models for unknown multisets of a measurable space.
Previous work has characterized random subsets arising from
conditionally-i.i.d. sequences of Bernoulli processes with a common beta
process base measure. In this case, the combinatorial structure is described
by the Indian buffet process. Our results give a count analogue of the Indian
buffet process, which we call a negative binomial Indian buffet process. As
an intermediate step toward this goal, we provide constructions for the beta
negative binomial process that avoid a representation of the underlying beta
process base measure.

Creighton Heaukulani and Zoubin Ghahramani.
**Dynamic
probabilistic models for latent feature propagation in social
networks**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Current Bayesian
models for dynamic social network data have focused on modelling the
influence of evolving unobserved structure on observed social interactions.
However, an understanding of how observed social relationships from the past
affect future unobserved structure in the network has been neglected. In this
paper, we introduce a new probabilistic model for capturing this phenomenon,
which we call latent feature propagation, in social networks. We demonstrate
our model's capability for inferring such latent structure in varying types
of social network datasets, and experimental studies show this structure
achieves higher predictive performance on link prediction and forecasting
tasks.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths.
**Testing a
Bayesian measure of representativeness using a large image
database**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** How do people
determine which elements of a set are most representative of that set? We
extend an existing Bayesian measure of representativeness, which indicates
the representativeness of a sample from a distribution, to deﬁne a measure
of the representativeness of an item to a set. We show that this measure is
formally related to a machine learning method known as Bayesian Sets.
Building on this connection, we derive an analytic expression for the
representativeness of objects described by a sparse vector of binary
features. We then apply this measure to a large database of images, using it
to determine which images are the most representative members of different
sets. Comparing the resulting predictions to human judgments of
representativeness provides a test of this measure with naturalistic stimuli,
and illustrates how databases that are more commonly used in computer vision
and machine learning can be used to evaluate psychological theories.

Sinead Williamson, Katherine A. Heller, C. Wang, and D. M. Blei.
**The IBP
compound Dirichlet process and its application to focused topic
modeling**.
In *27th International Conference on Machine Learning*, pages
1151-1158, Haifa, Israel, June 2010.

** Abstract:** The
hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed
membership model - each data point is modeled with a collection of
components of different proportions. Though powerful, the HDP makes an
assumption that the probability of a component being exhibited by a data
point is positively correlated with its proportion within that data point.
This might be an undesirable assumption. For example, in topic modeling, a
topic (component) might be rare throughout the corpus but dominant within
those documents (data points) where it occurs. We develop the IBP compound
Dirichlet process (ICD), a Bayesian nonparametric prior that decouples
across-data prevalence and within-data proportion in a mixed membership
model. The ICD combines properties from the HDP and the Indian buffet process
(IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a
subset of the shared mixture components to each data point. This subset, the
data point's "focus", is determined independently from the amount that each
of its components contribute. We develop an ICD mixture model for text, the
focused topic model (FTM), and show superior performance over the HDP-based
topic model.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani.
**Statistical
models for partial membership**.
In Andrew McCallum and Sam Roweis, editors, *25th International Conference
on Machine Learning*, pages 392-399, Helsinki, Finland, July 2008.
Omnipress.

** Abstract:** We present a principled Bayesian
framework for modeling partial memberships of data points to clusters. Unlike
a standard mixture model which assumes that each data point belongs to one
and only one mixture component, or cluster, a partial membership model allows
data points to have fractional membership in multiple clusters. Algorithms
which assign data points partial memberships to clusters can be useful for
tasks such as clustering genes based on microarray data (Gasch & Eisen,
2002). Our Bayesian Partial Membership Model (BPM) uses exponential family
distributions to model each cluster, and a product of these distibtutions,
with weighted parameters, to model each datapoint. Here the weights
correspond to the degree to which the datapoint belongs to each cluster. All
parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to
perform inference and learning. We discuss relationships between the BPM and
Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA,
and fuzzy clustering. Lastly, we show some experimental results and discuss
nonparametric extensions to our model.

Katherine A. Heller and Zoubin Ghahramani.
**A nonparametric
Bayesian approach to modeling overlapping clusters**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 187-194. JMLR.org, 2007.

**
Abstract:** Although clustering data into mutually exclusive partitions has
been an extremely successful approach to unsupervised learning, there are
many situations in which a richer model is needed to fully represent the
data. This is the case in problems where data points actually simultaneously
belong to multiple, overlapping clusters. For example a particular gene may
have several functions, therefore belonging to several distinct clusters of
genes, and a biologist may want to discover these through unsupervised
modeling of gene expression data. We present a new nonparametric Bayesian
method, the Infinite Overlapping Mixture Model (IOMM), for modeling
overlapping clusters. The IOMM uses exponential family distributions to model
each cluster and forms an overlapping mixture by taking products of such
distributions, much like products of experts (Hinton, 2002). The IOMM allows
an unbounded number of clusters, and assignments of points to (multiple)
clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and
Ghahramani, 2006). The IOMM has the desirable properties of being able to
focus in on overlapping regions while maintaining the ability to model a
potentially infinite number of clusters which may overlap. We derive MCMC
inference algorithms for the IOMM and show that these can be used to cluster
movies into multiple genres.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani.
**Analogical
reasoning with relational Bayesian sets**.
In Marina Meila and Xiaotong Shen, editors, *AISTATS*, volume 2 of
*JMLR Proceedings*, pages 500-507. JMLR.org, 2007.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. There are many ways in
which objects can be related, making automated analogical reasoning very
chal- lenging. Here we develop an approach which, given a set of pairs of
related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs
A:B fit in with the set S. This addresses the question: is the relation
between objects A and B analogous to those relations found in S? We recast
this classi- cal problem as a problem of Bayesian analy- sis of relational
data. This problem is non- trivial because direct similarity between ob-
jects is not a good way of measuring analo- gies. For instance, the analogy
between an electron around the nucleus of an atom and a planet around the Sun
is hardly justified by isolated, non-relational, comparisons of an electron
to a planet, and a nucleus to the Sun. We develop a generative model for
predicting the existence of relationships and extend the framework of
Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous
a relation is to other relations. This sheds new light on an old problem,
which we motivate and illustrate through practical applications in
exploratory data analysis.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Katherine A. Heller and Zoubin Ghahramani.
**A simple Bayesian
framework for content-based image retrieval**.
In *CVPR*, pages 2110-2117. IEEE Computer Society, 2006.

** Abstract:** We present a Bayesian framework for content-based
image retrieval which models the distribution of color and texture features
within sets of related images. Given a userspecified text query (e.g.
"penguins") the system first extracts a set of images, from a labelled
corpus, corresponding to that query. The distribution over features of these
images is used to compute a Bayesian score for each image in a large
unlabelled corpus. Unlabelled images are then ranked using this score and the
top images are returned. Although the Bayesian score is based on computing
marginal likelihoods, which integrate over model parameters, in the case of
sparse binary data the score reduces to a single matrix-vector multiplication
and is therefore extremely efficient to compute. We show that our method
works surprisingly well despite its simplicity and the fact that no relevance
feedback is used. We compare different choices of features, and evaluate our
results using human subjects.

Katherine A. Heller and Zoubin Ghahramani.
**Bayesian
hierarchical clustering**.
In Luc De Raedt and Stefan Wrobel, editors, *ICML*, volume 119 of
*ACM International Conference Proceeding Series*, pages 297-304.
Association for Computing Machinery, 2005.

** Abstract:** We
present a novel algorithm for agglomerative hierarchical clustering based on
evaluating marginal likelihoods of a probabilistic model. This algorithm has
several advantages over traditional distance-based agglomerative clustering
algorithms. (1) It defines a probabilistic model of the data which can be
used to compute the predictive distribution of a test point and the
probability of it belonging to any of the existing clusters in the tree. (2)
It uses a model-based criterion to decide on merging clusters rather than an
ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide
which merges are advantageous and to output the recommended depth of the
tree. (4) The algorithm can be interpreted as a novel fast bottom-up
approximate inference method for a Dirichlet process (i.e. countably
infinite) mixture model (DPM). It provides a new lower bound on the marginal
likelihood of a DPM by summing over exponentially many clusterings of the
data in polynomial time. We describe procedures for learning the model
hyperpa-rameters, computing the predictive distribution, and extensions to
the algorithm. Experimental results on synthetic and real-world data sets
demonstrate useful properties of the algorithm.

Ross M. Clarke, Elre T. Oldewage, and José Miguel Hernández-Lobato.
**Scalable one-pass
optimisation of high-dimensional weight-update hyperparameters by implicit
differentiation**.
In *10th International Conference on Learning Representations*, Virtual,
April 2022.

** Abstract:** Machine learning training methods
depend plentifully and intricately on hyperparameters, motivating automated
strategies for their optimisation. Many existing algorithms restart training
for each new hyperparameter choice, at considerable computational cost. Some
hypergradient- based one-pass methods exist, but these either cannot be
applied to arbitrary optimiser hyperparameters (such as learning rates and
momenta) or take several times longer to train than their base models. We
extend these existing methods to develop an approximate hypergradient-based
hyperparameter optimiser which is applicable to any continuous hyperparameter
appearing in a differentiable model weight update, yet requires only one
training episode, with no restarts. We also provide a motivating argument for
convergence to the true hypergradient, and perform tractable gradient-based
optimisation of independent learning rates for each model parameter. Our
method performs competitively from varied random hyperparameter
initialisations on several UCI datasets and Fashion-MNIST (using a one-layer
MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time
only 2-3x greater than vanilla training.

**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

Wenlin Chen, Austin Tripp, and José Miguel Hernández-Lobato.
**Meta-learning adaptive deep
kernel Gaussian processes for molecular property prediction**.
*arXiv*, 2022.

** Abstract:** We propose Adaptive Deep
Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework
for learning deep kernel Gaussian processes (GPs) by interpolating between
meta-learning and conventional deep kernel learning. Our approach employs a
bilevel optimization objective where we meta-learn generally useful feature
representations across tasks, in the sense that task-specific GP models
estimated on top of such features achieve the lowest possible predictive loss
on average. We solve the resulting nested optimization problem using the
implicit function theorem (IFT). We show that our ADKF-IFT framework contains
previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT)
as special cases. Although ADKF-IFT is a completely general method, we argue
that it is especially well-suited for drug discovery problems and demonstrate
that it significantly outperforms previous state-of-the-art methods on a
variety of real-world few-shot molecular property prediction tasks and
out-of-domain molecular property prediction and optimization tasks.

Gergely Flamich, Stratis Markou, and José Miguel Hernández-Lobato.
**Fast relative entropy coding with
A* coding**.
In *39th International Conference on Machine Learning*, 2022.

** Abstract:** Relative entropy coding (REC) algorithms encode a
sample from a target distribution $Q$ using a proposal distribution $P$, such
that the expected codelength is $\mathcalO(D_KL[Q||P])$. REC can be
seamlessly integrated with existing learned compression models since, unlike
entropy coding, it does not assume discrete $Q$ or $P$, and does not require
quantisation. However, general REC algorithms require an intractable
$Ømega(e^D_KL[Q||P])$ runtime. We introduce AS* and AD* coding, two REC
algorithms based on A* sampling. We prove that, for continuous distributions
over $\mathbbR$, if the density ratio is unimodal, AS* has
$\mathcalO(D_[Q||P]QP)$ expected runtime, where
$D_[Q||P]QP$ is the Rényi $\infty$-divergence. We provide
experimental evidence that AD* also has $\mathcalO(D_[Q||P]QP)$
expected runtime. We prove that AS* and AD* achieve an expected codelength of
$\mathcalO(D_KL[Q||P])$. Further, we introduce DAD*, an approximate
algorithm based on AD* which retains its favourable runtime and has bias
similar to that of alternative methods. Focusing on VAEs, we propose the
IsoKL VAE (IKVAE), which can be used with DAD* to further improve compression
efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can
losslessly compress images near the theoretically optimal limit.

Austin Tripp, Wenlin Chen, and José Miguel Hernández-Lobato.
**An evaluation framework
for the objective functions of de novo drug design benchmarks**.
In *ICLR 2022 Workshop on Machine Learning for Drug Discovery*, 2022.

** Abstract:** De novo drug design has recently received
increasing attention from the machine learning community. It is important
that the field is aware of the actual goals and challenges of drug design and
the roles that de novo molecule design algorithms could play in accelerating
the process, so that algorithms can be evaluated in a way that reflects how
they would be applied in real drug design scenarios. In this paper, we
propose a framework for critically assessing the merits of benchmarks, and
argue that most of the existing de novo drug design benchmark functions are
either highly unrealistic or depend upon a surrogate model whose performance
is not well characterized. In order for the field to achieve its long-term
goals, we recommend that poor benchmarks (especially logP and QED) be
deprecated in favour of better benchmarks. We hope that our proposed
framework can play a part in developing new de novo drug design benchmarks
that are more realistic and ideally incorporate the intrinsic goals of drug
design.

**Getting a CLUE: A
method for explaining uncertainty estimates**.
In *9th International Conference on Learning Representations*, April
2021.

** Abstract:** Both uncertainty estimation and
interpretability are important factors for trustworthy machine learning
systems. However, there is little work at the intersection of these two
areas. We address this gap by proposing a novel method for interpreting
uncertainty estimates from differentiable probabilistic models, like Bayesian
Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty
Explanations (CLUE), indicates how to change an input, while keeping it on
the data manifold, such that a BNN becomes more confident about the input's
prediction. We validate CLUE through 1) a novel framework for evaluating
counterfactual explanations of uncertainty, 2) a series of ablation
experiments, and 3) a user study. Our experiments show that CLUE outperforms
baselines and enables practitioners to better understand which input patterns
are responsible for predictive uncertainty..

**Bayesian deep
learning via subnetwork inference**.
In Marina Meila and Tong Zhang, editors, *32nd International Conference on
Machine Learning*, volume 139 of *Proceedings of Machine Learning
Research*, pages 2510-2521. PMLR, 2021.

** Abstract:**
The Bayesian paradigm has the potential to solve core issues of deep neural
networks such as poor calibration and data inefficiency. Alas, scaling
Bayesian inference to large weight spaces often requires restrictive
approximations. In this work, we show that it suffices to perform inference
over a small subset of model weights in order to obtain accurate predictive
posteriors. The other weights are kept as point estimates. This subnetwork
inference framework enables us to use expressive, otherwise intractable,
posterior approximations over such subsets. In particular, we implement
subnetwork linearized Laplace: We first obtain a MAP estimate of all weights
and then infer a full-covariance Gaussian posterior over a subnetwork. We
propose a subnetwork selection strategy that aims to maximally preserve the
model's predictive uncertainty. Empirically, our approach is effective
compared to ensembles and less expressive posterior approximations over full
networks.

Chelsea Murray, James Urquhart Allingham, Javier Antorán, and José Miguel
Hernández-Lobato.
**Addressing bias
in active learning with depth uncertainty networks... or not**.
In Melanie F. Pradier, Aaron Schein, Stephanie L. Hyland, Francisco J. R. Ruiz,
and Jessica Zosa Forde, editors, *I (Still) Can't Believe It's Not Better!
Workshop at NeurIPS 2021, Virtual Workshop, December 13, 2021*, volume
163 of *Proceedings of Machine Learning Research*, pages 59-63.
PMLR, 2021.

** Abstract:** Farquhar et al. [2021] show that
correcting for active learning bias with underparameterised models leads to
improved downstream performance. For overparameterised models such as NNs,
however, correction leads either to decreased or unchanged performance. They
suggest that this is due to an "overfitting bias" which offsets the active
learning bias. We show that depth uncertainty networks operate in a low
overfitting regime, much like underparameterised models. They should
therefore see an increase in performance with bias correction. Surprisingly,
they do not. We propose that this negative result, as well as the results
Farquhar et al. [2021], can be explained via the lens of the bias-variance
decomposition of generalisation error.

Javier Antorán, James Urquhart Allingham, and José Miguel
Hernández-Lobato.
**Depth
uncertainty in neural networks**.
In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin, editors, *Advances in Neural Information Processing
Systems 33*, 2020.

** Abstract:** Existing methods for
estimating uncertainty in deep learning tend to require multiple forward
passes, making them unsuitable for applications where computational resources
are limited. To solve this, we perform probabilistic reasoning over the depth
of neural networks. Different depths correspond to subnetworks which share
weights and whose predictions are combined via marginalisation, yielding
model uncertainty. By exploiting the sequential structure of feed-forward
networks, we are able to both evaluate our training objective and make
predictions with a single forward pass. We validate our approach on
real-world regression and image classification tasks. Our approach provides
uncertainty calibration, robustness to dataset shift, and accuracies
competitive with more computationally expensive baselines.

** Comment:** Code

Eric Nalisnick, José Miguel Hernández-Lobato, and Padhraic Smyth.
**Dropout as a
structured shrinkage prior**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** Dropout regularization of deep neural
networks has been a mysterious yet effective tool to prevent overfitting.
Explanations for its success range from the prevention of co-adapted weights
to it being a form of cheap Bayesian inference. We propose a novel framework
for understanding multiplicative noise in neural networks, considering
continuous distributions as well as Bernoulli noise (i.e. dropout). We show
that multiplicative noise induces structured shrinkage priors on a network's
weights. We derive the equivalence through reparametrization properties of
scale mixtures and without invoking any approximations. Given the
equivalence, we then show that dropout's Monte Carlo training objective
approximates marginal MAP estimation. We leverage these insights to propose a
novel shrinkage framework for resnets, terming the prior 'automatic depth
determination' as it is the natural analog of automatic relevance
determination for network depth. Lastly, we investigate two inference
strategies that improve upon the aforementioned MAP approximation in
regression benchmarks.

David Janz, Jiri Hron, Przemyslaw Mazur, José Miguel Hernández-Lobato,
Katja Hofmann, and Sebastian Tschiatschek.
**Successor Uncertainties:
exploration and uncertainty in temporal difference learning**.
*NeurIPS*, 2019.

** Abstract:** Posterior sampling for
reinforcement learning (PSRL) is an effective method for balancing
exploration and exploitation in reinforcement learning. Randomised value
functions (RVF) can be viewed as a promising approach to scaling PSRL.
However, we show that most contemporary algorithms combining RVF with neural
network function approximation do not possess the properties which make PSRL
effective, and provably fail in sparse reward problems. Moreover, we find
that propagation of uncertainty, a property of PSRL previously thought
important for exploration, does not preclude this failure. We use these
insights to design Successor Uncertainties (SU), a cheap and easy to
implement RVF algorithm that retains key properties of PSRL. SU is highly
effective on hard tabular exploration benchmarks. Furthermore, on the Atari
2600 domain, it surpasses human performance on 38 of 49 games tested
(achieving a median human normalised score of 2.09), and outperforms its
closest RVF competitor, Bootstrapped DQN, on 36 of those.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, and Jose Miguel
Hernández-Lobato.
**Bayesian
batch active learning as sparse subset approximation**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** Leveraging the wealth of unlabeled data produced
in recent years provides great potential for improving supervised models.
When the cost of acquiring labels is high, probabilistic active learning
methods can be used to greedily select the most informative data points to be
labeled. However, for many large-scale problems standard greedy procedures
become computationally infeasible and suffer from negligible model change. In
this paper, we introduce a novel Bayesian batch active learning approach that
mitigates these issues. Our approach is motivated by approximating the
complete data posterior of the model parameters. While naive batch
construction methods result in correlated queries, our algorithm produces
diverse batches that enable efficient active learning at scale. We derive
interpretable closed-form solutions akin to existing active learning
procedures for linear models, and generalize to arbitrary models using random
projections. We demonstrate the benefits of our approach on several
large-scale regression and classification tasks.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato,
Richard E. Turner, and Douglas Eck.
**Sequence tutor: Conservative
fine-tuning of sequence generation models with kl-control**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** This paper proposes a
general method for improving the structure and quality of sequences generated
by a recurrent neural network (RNN), while maintaining information originally
learned from data, as well as sample diversity. An RNN is first pre-trained
on data using maximum likelihood estimation (MLE), and the probability
distribution over the next token in the sequence learned by this model is
treated as a prior policy. Another RNN is then trained using reinforcement
learning (RL) to generate higher-quality outputs that account for
domain-specific incentives while retaining proximity to the prior policy of
the MLE RNN. To formalize this objective, we derive novel off-policy RL
methods for RNNs from KL-control. The effectiveness of the approach is
demonstrated on two applications; 1) generating novel musical melodies, and
2) computational molecular generation. For both problems, we show that the
proposed method improves the desired properties and structure of the
generated sequences, while maintaining information learned from data.

** Comment:** [MIT
Technology Review] [Video]

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner.
**Stochastic Expectation
Propagation**.
In *Advances in Neural Information Processing Systems 28*, Montréal
CANADA, Dec 2015.

** Abstract:** Expectation propagation (EP)
is a deterministic approximation algorithm that is often used to perform
approximate Bayesian parameter learning. EP approximates the full intractable
posterior distribution through a set of local-approximations that are
iteratively refined for each datapoint. EP can offer analytic and
computational advantages over other approximations, such as Variational
Inference (VI), and is the method of choice for a number of models. The local
nature of EP appears to make it an ideal candidate for performing Bayesian
learning on large models in large-scale datasets settings. However, EP has a
crucial limitation in this context: the number approximating factors needs to
increase with the number of data-points, N, which often entails a
prohibitively large memory overhead. This paper presents an extension to EP,
called stochastic expectation propagation (SEP), that maintains a global
posterior approximation (like VI) but updates it in a local way (like EP ).
Experiments on a number of canonical learning problems using synthetic and
real-world datasets indicate that SEP performs almost as well as full EP, but
reduces the memory consumption by a factor of N. SEP is therefore ideally
suited to performing approximate Bayesian learning in the large model, large
dataset setting.

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman,
Ryan P. Adams, and Zoubin Ghahramani.
**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process volatility model**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** The prediction
of time-changing variances is an important task in the modeling of financial
data. Standard econometric models are often limited as they assume rigid
functional relationships for the evolution of the variance. Moreover,
functional parameters are usually learned by maximum likelihood, which can
lead to overfitting. To address these problems we introduce GP-Vol, a novel
non-parametric model for time-changing variances based on Gaussian Processes.
This new model can capture highly flexible functional relationships for the
variances. Furthermore, we introduce a new online algorithm for fast
inference in GP-Vol. This method is much faster than current offline
inference procedures and it avoids overfitting problems by following a fully
Bayesian approach. Experiments with financial data show that GP-Vol performs
significantly better than current standard alternatives.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Daniel Hernández-Lobato and José Miguel Hernández-Lobato.
**Learning
feature selection dependencies in multi-task learning**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** A
probabilistic model based on the horseshoe prior is proposed for learning de-
pendencies in the process of identifying relevant features for prediction.
Exact inference is intractable in this model. However, expectation
propagation offers an approximate alternative. Because the process of
estimating feature selection dependencies may suffer from over-fitting in the
model proposed, additional data from a multi-task learning scenario are
considered for induction. The same model can be used in this setting with few
modifications. Furthermore, the assumptions made are less restrictive than in
other multi-task methods: The different tasks must share feature selection
dependencies, but can have different relevant features and model
coefficients. Experiments with real and synthetic data show that this model
performs better than other multi-task alternatives from the literature. The
experiments also show that the model is able to induce suitable feature
selection dependencies for the problems considered, only from the training
data.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Generalized
spike-and-slab priors for Bayesian group feature selection using
expectation propagation**.
*Journal of Machine Learning Research*, 14:1891-1945, July 2013.

** Abstract:** We describe a Bayesian method for group feature
selection in linear regression problems. The method is based on a generalized
version of the standard spike-and-slab prior distribution which is often used
for individual feature selection. Exact Bayesian inference under the prior
considered is infeasible for typical regression problems. However,
approximate inference can be carried out efficiently using Expectation
Propagation (EP). A detailed analysis of the generalized spike-and-slab
prior shows that it is well suited for regression problems that are sparse at
the group level. Furthermore, this prior can be used to introduce prior
knowledge about specific groups of features that are a priori believed to be
more relevant. An experimental evaluation compares the performance of the
proposed method with those of group LASSO, Bayesian group LASSO,
automatic relevance determination and additional variants used for group
feature selection. The results of these experiments show that a model based
on the generalized spike-and-slab prior and the EP algorithm has
state-of-the-art prediction performance in the problems analyzed.
Furthermore, this model is also very useful to carry out sequential
experimental design (also known as active learning), where the data instances
that are most informative are iteratively included in the training set,
reducing the number of instances needed to obtain a particular level of
prediction accuracy.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Michael Kaschesky, Pawel Sobkowicz, José Miguel Hernández-Lobato, Guillaume
Bouchard, Cedric Archambeau, Nicolas Scharioth, Robert Manchin, Adrian
Gschwend, and Reinhard Riedl.
**Bringing
representativeness into social media monitoring and analysis**.
In *46th Hawaii International Conference on System Sciences*, Manoa,
Hawaii, 2013.

** Abstract:** The opinions, expectations and
behavior of citizens are increasingly reflected online - therefore, mining
the internet for such data can enhance decision-making in public policy,
communications, marketing, finance and other fields. However, to come closer
to the representativeness of classic opinion surveys there is a lack of
knowledge about the sociodemographic characteristics of those voicing
opinions on the internet. This paper proposes to calibrate online opinions
aggregated from multiple and heterogeneous data sources with traditional
surveys enhanced with rich socio-demographic information to enable insights
into which opinions are expressed on the internet by specific segments of
society. The goal of this research is to provide professionals in citizen-
and consumer- centered domains with more concise near real-time intelligence
on online opinions. To become effective, the methodologies presented in this
paper must be integrated into a coherent decision support system.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf.
**Semi-supervised
domain adaptation with non-parametric copulas**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
new framework based on the theory of copulas is proposed to address
semisupervised domain adaptation problems. The presented method factorizes
any multivariate density into a product of marginal distributions and
bivariate copula functions. Therefore, changes in each of these factors can
be detected and corrected to adapt a density model accross different learning
domains. Importantly, we introduce a novel vine copula model, which allows
for this factorization in a non-parametric manner. Experimental results on
regression problems with real-world data illustrate the efficacy of the
proposed approach when compared to state-of-the-art techniques.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Robust
multi-class Gaussian process classification**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Multi-class Gaussian Processs Classifiers (MGPCs)
are often affected by overfitting problems when labeling errors occur far
from the decision boundaries. To prevent this, we investigate a robust MGPC
(RMGPC) which considers labeling errors independently of their distance to
the decision boundaries. Expectation propagation is used for approximate
inference. Experiments with several datasets in which noise is injected in
the labels illustrate the benefits of RMGPC. This method performs better than
other Gaussian process alternatives based on considering latent Gaussian
noise or heavy-tailed processes. When no noise is injected in the labels,
RMGPC still performs equal or better than the other methods. Finally, we show
how RMGPC can be used for successfully indentifying data instances which are
difficult to classify correctly in practice.

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman,
Ryan P. Adams, and Zoubin Ghahramani.
**Predictive
entropy search for Bayesian optimization with unknown constraints**.
In *32nd International Conference on Machine Learning*, pages
1699-1707, 2015.

** Abstract:** Unknown constraints arise in
many types of expensive black-box optimization problems. Several methods have
been proposed recently for performing Bayesian optimization with constraints,
based on the expected improvement (EI) heuristic. However, EI can lead to
pathologies when used with constraints. For example, in the case of decoupled
constraints—i.e., when one can independently evaluate the objective or the
constraints—EI can encounter a pathology that prevents exploration.
Additionally, computing EI requires a current best solution, which may not
exist if none of the data collected so far satisfy the constraints. By
contrast, information-based approaches do not suffer from these failure
modes. In this paper, we present a new information-based method called
Predictive Entropy Search with Constraints (PESC). We analyze the performance
of PESC and show that it compares favorably to EI-based approaches on
synthetic and benchmark problems, as well as several real-world examples. We
demonstrate that PESC is an effective algorithm that provides a promising
direction towards a unified solution for constrained Bayesian
optimization.

José Miguel Hernández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani.
**Predictive
entropy search for efficient global optimization of black-box
functions**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2014.

** Abstract:** We propose a
novel information-theoretic approach for Bayesian optimization called
Predictive Entropy Search (PES). At each iteration, PES selects the next
evaluation point that maximizes the expected information gained with respect
to the global maximum. PES codifies this intractable acquisition function in
terms of the expected reduction in the differential entropy of the predictive
distribution. This reformulation allows PES to obtain approximations that are
both more accurate and efficient than other alternatives such as Entropy
Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment
of the model hyperparameters while ES cannot. We evaluate PES in both
synthetic and realworld applications, including optimization problems in
machine learning, finance, biotechnology, and robotics. We show that the
increased accuracy of PES leads to significant gains in optimization
performance.

Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas.
**On
correlation and budget constraints in model-based bandit optimization with
application to automatic machine learning**.
In *17th International Conference on Artificial Intelligence and
Statistics*, pages 365-374, Reykjavik, Iceland, April 2014.

** Abstract:** We address the problem of finding the maximizer
of a nonlinear function that can only be evaluated, subject to noise, at a
finite number of query locations. Further, we will assume that there is a
constraint on the total number of permitted function evaluations. We
introduce a Bayesian approach for this problem and show that it empirically
outperforms both the existing frequentist counterpart and other Bayesian
optimization methods. The Bayesian approach places emphasis on detailed
modelling, including the modelling of correlations among the arms. As a
result, it can perform well in situations where the number of arms is much
larger than the number of allowed function evaluation, whereas the
frequentist counterpart is inapplicable. This feature enables us to develop
and deploy practical applications, such as automatic machine learning
toolboxes. The paper presents comprehensive comparisons of the proposed
approach with many Bayesian and bandit optimization techniques, the first
comparison of many of these methods in the literature.

James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan
Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji
Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, and
Rodolphe Jenatton.
**Sparse MoEs meet
efficient ensembles**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Machine learning models based on the aggregated outputs of
submodels, either at the activation or prediction levels, often exhibit
strong performance compared to individual models. We study the interplay of
two popular classes of such models: ensembles of neural networks and sparse
mixture of experts (sparse MoEs). First, we show that the two approaches have
complementary features whose combination is beneficial. This includes a
comprehensive evaluation of sparse MoEs in uncertainty related benchmarks.
Then, we present efficient ensemble of experts (E^{3}), a scalable
and simple ensemble of sparse MoEs that takes the best of both classes of
models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive
experiments demonstrate the accuracy, log-likelihood, few-shot learning,
robustness, and uncertainty improvements of E^{3} over several
challenging vision Transformer-based baselines. E^{3} not only
preserves its efficiency while scaling to models with up to 2.7B parameters,
but also provides better predictive performance and uncertainty estimates for
larger models.

** Comment:** Code

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Cold-start
active learning with robust ordinal matrix factorization**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a new matrix
factorization model for rating data and a corresponding active learning
strategy to address the cold-start problem. Cold-start is one of the most
challenging tasks for recommender systems: what to recommend with new users
or items for which one has little or no data. An approach is to use active
learning to collect the most useful initial ratings. However, the performance
of active learning depends strongly upon having accurate estimates of i) the
uncertainty in model parameters and ii) the intrinsic noisiness of the data.
To achieve these estimates we propose a heteroskedastic Bayesian model for
ordinal matrix factorization. We also present a computationally efficient
framework for Bayesian active learning with this type of complex
probabilistic model. This algorithm successfully distinguishes between
informative and noisy data points. Our model yields state-of-the-art
predictive performance and, coupled with our active learning strategy,
enables us to gain useful information in the cold-start setting from the very
first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Probabilistic
matrix factorization with non-random missing data**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We propose a probabilistic matrix
factorization model for collaborative filtering that learns from data that is
missing not at random (MNAR). Matrix factorization models exhibit
state-of-the-art predictive performance in collaborative filtering. However,
these models usually assume that the data is missing at random (MAR), and
this is rarely the case. For example, the data is not MAR if users rate items
they like more than ones they dislike. When the MAR assumption is incorrect,
inferences are biased and predictive performance can suffer. Therefore, we
model both the generative process for the data and the missing data
mechanism. By learning these two models jointly we obtain improved
performance over state-of-the-art methods when predicting the ratings and
when modeling the data observation process. We present the first viable MF
model for MNAR data. Our results are promising and we expect that further
research on NMAR models will yield large gains in collaborative
filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** Fully observed large binary matrices
appear in a wide variety of contexts. To model them, probabilistic matrix
factorization (PMF) methods are an attractive solution. However, current
batch algorithms for PMF can be inefficient because they need to analyze the
entire data matrix before producing any parameter updates. We derive an
efficient stochastic inference algorithm for PMF models of fully observed
binary matrices. Our method exhibits faster convergence rates than more
expensive batch approaches and has better predictive performance than
scalable alternatives. The proposed method includes new data subsampling
strategies which produce large gains over standard uniform subsampling. We
also address the task of automatically selecting the size of the minibatches
of data used by our method. For this, we derive an algorithm that adjusts
this hyper-parameter online.

Neil Houlsby and Massimiliano Ciaramita.
**A
scalable Gibbs sampler for probabilistic entity linking**.
In *36th European Conference on Information Retrieval*, pages 335-346.
Springer, 2014.

** Abstract:** Entity linking involves
labeling phrases in text with their referent entities, such as Wikipedia or
Freebase entries. This task is challenging due to the large number of
possible entities, in the millions, and heavy-tailed mention ambiguity. We
formulate the problem in terms of probabilistic inference within a topic
model, where each topic is associated with a Wikipedia article. To deal with
the large number of topics we propose a novel efficient Gibbs sampling scheme
which can also incorporate side information, such as the Wikipedia graph.
This conceptually simple probabilistic approach achieves state-of-the-art
performance in entity-linking on the Aida-CoNLL dataset.

Neil Houlsby and Guy Houlsby.
**Statistical
fitting of undrained strength data**.
*Geotechnique*, 63(13):1253-1263, 2013, doi
10.1680/geot.13.P.007.

** Abstract:** We describe an
approach, based on Bayesian statistical methods, that allows the fitting of a
design profile to a set of measurements of undrained strengths. In particular
we allow for the automatic determination of not only the positions of
boundaries between geological units, but also the selection of the number of
units to model the data in an appropriate way.

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán,
Daniel M Wolpert, and Máté Lengyel.
**Cognitive
tomography reveals complex task-independent mental representations**.
*Current Biology*, 23(21):2169-2175, 2013, doi
10.1016/j.cub.2013.09.012.

** Abstract:** Humans develop
rich mental representations that guide their behavior in a variety of
every-day tasks. However, it is unknown whether these representations, often
formalized as priors in Bayesian inference, are specific for each task or
subserve multiple tasks. Current approaches cannot distinguish between these
two possibilities because they cannot extract comparable representations
across different tasks. Here, we develop a novel method, termed cognitive
tomography, that can extract complex, multi-dimensional priors across tasks.
We apply this method to human judgments in two qualitatively different tasks,
familiarity and odd-one-out, involving an ecologically relevant set of
stimuli, human faces. We show that priors over faces are structurally complex
and vary dramatically across subjects, but are invariant across the tasks
within each subject. The priors we extract from each task allow us to predict
with high precision the behavior of subjects for novel stimuli both in the
same task as well as in the other task. Our results provide the first
evidence for a single high-dimensional structured representation of a
naturalistic stimulus set that guides behavior in multiple tasks. Moreover,
the representations estimated by cognitive tomography can provide
independent, behavior-based regressors for elucidating the neural correlates
of complex naturalistic priors.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic
inference for scalable probabilistic modeling of binary matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani.
**Active
learning for interactive visualization**.
In *16th International Conference on Artificial Intelligence and
Statistics*, 2013.

** Abstract:** Many automatic
visualization methods have been proposed. However, a visualization that is
automatically generated might be different to how a user wants to arrange the
objects in visualization space. By allowing users to re-locate objects in the
embedding space of the visualization, they can adjust the visualization to
their preference. We propose an active learning framework for interactive
visualization which selects objects for the user to re-locate so that they
can obtain their desired visualization by re-locating as few as possible. The
framework is based on an information theoretic criterion, which favors
objects that reduce the uncertainty of the visualization. We present a
concrete application of the proposed framework to the Laplacian eigenmap
visualization method. We demonstrate experimentally that the proposed
framework yields the desired visualization with fewer user interactions than
existing methods.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc
Huszár, and Sergey Kulik.
**Experimental
adaptive Bayesian tomography**.
*Physical Review A*, 87(6):062122, 2013.

** Abstract:**
We report an experimental realization of an adaptive quantum state tomography
protocol. Our method takes advantage of a Bayesian approach to statistical
inference and is naturally tailored for adaptive strategies. For pure states
we observe close to $N^-1$ scaling of infidelity with overall number of
registered events, while best non-adaptive protocols allow for $N^-1/2$
scaling only. Experiments are performed for polarization qubits, but the
approach is readily adapted to any dimension.

Ferenc Huszár and Neil Houlsby.
**Adaptive Bayesian
quantum tomography**.
*Physical Review A*, 85(5):052120, 2012.

** Abstract:**
In this paper we revisit the problem of optimal design of quantum tomographic
experiments. In contrast to previous approaches where an optimal set of
measurements is decided in advance of the experiment, we allow for
measurements to be adaptively and efficiently re-optimised depending on data
collected so far. We develop an adaptive statistical framework based on
Bayesian inference and Shannon's information, and demonstrate a ten-fold
reduction in the total number of measurements required as compared to
non-adaptive methods, including mutually unbiased bases.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel.
**Bayesian active
learning for classification and preference learning**.
*arXiv*, abs/1112.5745, 2011.

** Abstract:** Information
theoretic active learning has been widely studied for probabilistic models.
For simple regression an optimal myopic policy is easily tractable. However,
for other tasks and with more complex models, such as classification with
nonparametric models, the optimal solution is harder to compute. Current
approaches make approximations to achieve tractability. We propose an
approach that expresses information gain in terms of predictive entropies,
and apply this method to the Gaussian Process Classifier (GPC). Our approach
makes minimal approximations to the full information theoretic objective. Our
experimental performance compares favourably to many popular active learning
algorithms, and has equal or lower computational complexity. We compare well
to decision theoretic approaches also, which are privy to more information
and require much more computational time. Secondly, by developing further a
reformulation of binary preference learning to a classification problem, we
extend our algorithm to Gaussian Process preference learning.

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, and Sarah Dean.
**Modelling content creator
incentives on algorithm-curated platforms**.
*arXiv*, 2022.

** Abstract:** Content creators compete
for user attention. Their reach crucially depends on algorithmic choices made
by developers on online platforms. To maximize exposure, many creators adapt
strategically, as evidenced by examples like the sprawling search engine
optimization industry. This begets competition for the finite user attention
pool. We formalize these dynamics in what we call an exposure game, a model
of incentives induced by algorithms including modern factorization and (deep)
two-tower architectures. We prove that seemingly innocuous algorithmic
choices—e.g., non-negative vs. unconstrained factorization—significantly
affect the existence and character of (Nash) equilibria in exposure games. We
proffer use of creator behavior models like ours for an (ex-ante)
pre-deployment audit. Such an audit can identify misalignment between
desirable and incentivized content, and thus complement post-hoc measures
like content filtering and moderation. To this end, we propose tools for
numerically finding equilibria in exposure games, and illustrate results of
an audit on the MovieLens and LastFM datasets. Among else, we find that the
strategically produced content exhibits strong dependence between algorithmic
exploration and content diversity, and between model expressivity and bias
towards gender-based user and creator groups.

Jiri Hron, Roman Novak, Jeffrey Pennington, and Jascha Sohl-Dickstein.
**Bayesian neural networks have a
simple weight posterior: theory and accelerated sampling**.
*ICML*, 2022.

** Abstract:** We introduce repriorisation,
a data-dependent reparameterisation which transforms a Bayesian neural
network (BNN) posterior to a distribution whose KL divergence to the BNN
prior vanishes as layer widths grow. The repriorisation map acts directly on
parameters, and its analytic simplicity complements the known neural network
Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting
the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior
sampling algorithm which mixes faster the wider the BNN. This contrasts with
the typically poor performance of MCMC in high dimensions. We observe up to
50x higher effective sample size relative to no reparametrisation for both
fully-connected and residual networks. Improvements are achieved at all
widths, with the margin between reparametrised and standard BNNs growing with
layer width.

Jiri Hron, Karl Krauth, Michael I. Jordan, and Niki Kilbertus.
**On component interactions in
two-stage recommender systems**.
*NeurIPS*, 2021.

** Abstract:** Thanks to their
scalability, two-stage recommenders are used by many of today's largest
online platforms, including YouTube, LinkedIn, and Pinterest. These systems
produce recommendations in two steps: (i) multiple nominators, tuned for low
prediction latency, preselect a small subset of candidates from the whole
item pool; (ii) a slower but more accurate ranker further narrows down the
nominated items, and serves to the user. Despite their popularity, the
literature on two-stage recommenders is relatively scarce, and the algorithms
are often treated as mere sums of their parts. Such treatment presupposes
that the two-stage performance is explained by the behavior of the individual
components in isolation. This is not the case: using synthetic and real-world
data, we demonstrate that interactions between the ranker and the nominators
substantially affect the overall performance. Motivated by these findings, we
derive a generalization lower bound which shows that independent nominator
training can lead to performance on par with uniformly random
recommendations. We find that careful design of item pools, each assigned to
a different nominator, alleviates these issues. As manual search for a good
pool allocation is difficult, we propose to learn one instead using a
Mixture-of-Experts based approach. This significantly improves both precision
and recall at K.

Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, and Jascha
Sohl-Dickstein.
**Exact posteriors of wide
Bayesian neural networks**.
*UDL (ICML workshop)*, 2020.

** Abstract:** Recent work
has shown that the prior over functions induced by a deep Bayesian neural
network (BNN) behaves as a Gaussian process (GP) as the width of all layers
becomes large. However, many BNN applications are concerned with the BNN
function space posterior. While some empirical evidence of the posterior
convergence was provided in the original works of Neal (1996) and Matthews et
al. (2018), it is limited to small datasets or architectures due to the
notorious difficulty of obtaining and verifying exactness of BNN posterior
approximations. We provide the missing theoretical proof that the exact BNN
posterior converges (weakly) to the one induced by the GP limit of the prior.
For empirical validation, we show how to generate exact samples from a finite
BNN on a small dataset via rejection sampling.

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak.
**Infinite attention: NNGP and
NTK for deep attention networks**.
*ICML*, 2020.

** Abstract:** There is a growing amount of
literature on the relationship between wide neural networks (NNs) and
Gaussian processes (GPs), identifying an equivalence between the two for a
variety of NN architectures. This equivalence enables, for instance, accurate
approximation of the behaviour of wide Bayesian NNs without MCMC or
variational approximations, or characterisation of the distribution of
randomly initialised wide NNs optimised by gradient descent without ever
running an optimiser. We provide a rigorous extension of these results to NNs
involving attention layers, showing that unlike single-head attention, which
induces non-Gaussian behaviour, multi-head attention architectures behave as
GPs as the number of heads tends to infinity. We further discuss the effects
of positional encodings and layer normalisation, and propose modifications of
the attention mechanism which lead to improved results for both finite and
infinitely wide NNs. We evaluate attention kernels empirically, leading to a
moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs
without trainable kernels and advanced data preprocessing. Finally, we
introduce new features to the Neural Tangents library (Novak et al., 2020)
allowing applications of NNGP/NTK models, with and without attention, to
variable-length sequences, with an example on the IMDb reviews dataset.

Jiri Hron, Karl Krauth, Michael I. Jordan, and Niki Kilbertus.
**Exploration in two-stage
recommender systems**.
*REVEAL (ACM RecSys workshop)*, 2020.

** Abstract:**
Two-stage recommender systems are widely adopted in industry due to their
scalability and maintainability. These systems produce recommendations in two
steps: (i) multiple nominators preselect a small number of items from a large
pool using cheap-to-compute item embeddings; (ii) with a richer set of
features, a ranker rearranges the nominated items and serves them to the
user. A key challenge of this setup is that optimal performance of each stage
in isolation does not imply optimal global performance. In response to this
issue, Ma et al. (2020) proposed a nominator training objective importance
weighted by the ranker's probability of recommending each item. In this work,
we focus on the complementary issue of exploration. Modeled as a contextual
bandit problem, we find LinUCB (a near optimal exploration strategy for
single-stage systems) may lead to linear regret when deployed in two-stage
recommenders. We therefore propose a method of synchronising the exploration
strategies between the ranker and the nominators. Our algorithm only relies
on quantities already computed by standard LinUCB at each stage and can be
implemented in three lines of additional code. We end by demonstrating the
effectiveness of our algorithm experimentally.

Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Jascha Sohl-Dickstein, and
Samuel Schoenholz.
**Neural Tangents: fast and easy
infinite networks in Python**.
*ICLR*, 2020.

** Abstract:** Neural Tangents is a library
designed to enable research into infinite-width neural networks. It provides
a high-level API for specifying complex and hierarchical neural network
architectures. These networks can then be trained and evaluated either at
finite-width as usual or in their infinite-width limit. Infinite-width
networks can be trained analytically using exact Bayesian inference or using
gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents
provides tools to study gradient descent training dynamics of wide but finite
networks in either function space or weight space. The entire library runs
out-of-the-box on CPU, GPU, or TPU. All computations can be automatically
distributed over multiple accelerators with near-linear scaling in the number
of devices.

Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and
Adrian Weller.
**Orthogonal
estimation of Wasserstein distances**.
In *22nd International Conference on Artificial Intelligence and
Statistics*, Okinawa, Japan, April 2019.

** Abstract:**
Wasserstein distances are increasingly used in a wide variety of applications
in machine learning. Sliced Wasserstein distances form an important subclass
which may be estimated efficiently through one-dimensional sorting
operations. In this paper, we propose a new variant of sliced Wasserstein
distance, study the use of orthogonal coupling in Monte Carlo estimation of
Wasserstein distances and draw connections with stratified sampling, and
evaluate our approaches experimentally in a range of large-scale experiments
in generative modelling and reinforcement learning.

David Janz, Jiri Hron, Przemyslaw Mazur, José Miguel Hernández-Lobato,
Katja Hofmann, and Sebastian Tschiatschek.
**Successor Uncertainties:
exploration and uncertainty in temporal difference learning**.
*NeurIPS*, 2019.

** Abstract:** Posterior sampling for
reinforcement learning (PSRL) is an effective method for balancing
exploration and exploitation in reinforcement learning. Randomised value
functions (RVF) can be viewed as a promising approach to scaling PSRL.
However, we show that most contemporary algorithms combining RVF with neural
network function approximation do not possess the properties which make PSRL
effective, and provably fail in sparse reward problems. Moreover, we find
that propagation of uncertainty, a property of PSRL previously thought
important for exploration, does not preclude this failure. We use these
insights to design Successor Uncertainties (SU), a cheap and easy to
implement RVF algorithm that retains key properties of PSRL. SU is highly
effective on hard tabular exploration benchmarks. Furthermore, on the Atari
2600 domain, it surpasses human performance on 38 of 49 games tested
(achieving a median human normalised score of 2.09), and outperforms its
closest RVF competitor, Bootstrapped DQN, on 36 of those.

Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron,
Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein.
**Bayesian deep CNNs with many
channels are Gaussian processes**.
*ICLR*, 2019.

** Abstract:** There is a previously
identified equivalence between wide fully connected neural networks (FCNs)
and Gaussian processes (GPs). This equivalence enables, for instance, test
set predictions that would have resulted from a fully Bayesian, infinitely
wide trained FCN to be computed without ever instantiating the FCN, but by
instead evaluating the corresponding GP. In this work, we derive an analogous
equivalence for multi-layer convolutional neural networks (CNNs) both with
and without pooling layers, and achieve state of the art results on CIFAR10
for GPs without trainable kernels. We also introduce a Monte Carlo method to
estimate the GP corresponding to a given neural network architecture, even in
cases where the analytic form has too many terms to be computationally
feasible. Surprisingly, in the absence of pooling layers, the GPs
corresponding to CNNs with and without weight sharing are identical. As a
consequence, translation equivariance, beneficial in finite channel CNNs
trained with stochastic gradient descent (SGD), is guaranteed to play no role
in the Bayesian treatment of the infinite channel limit—a qualitative
difference between the two regimes that is not present in the FCN case. We
confirm experimentally, that while in some scenarios the performance of
SGD-trained finite CNNs approaches that of the corresponding GPs as the
channel count increases, with careful tuning SGD-trained CNNs can
significantly outperform their corresponding GPs, suggesting advantages from
SGD training compared to fully Bayesian parameter estimation.

Jiri Hron, Alexander G. D. G. Matthews, and Zoubin Ghahramani.
**Variational Bayesian dropout:
pitfalls and fixes**.
*ICML*, 2018.

** Abstract:** Dropout, a stochastic
regularisation technique for training of neural networks, has recently been
reinterpreted as a specific type of approximate inference algorithm for
Bayesian neural networks. The main contribution of the reinterpretation is in
providing a theoretical framework useful for analysing and extending the
algorithm. We show that the proposed framework suffers from several issues;
from undefined or pathological behaviour of the true posterior related to use
of improper priors, to an ill-defined variational objective due to
singularity of the approximating distribution relative to the true posterior.
Our analysis of the improper log uniform prior used in variational Gaussian
dropout suggests the pathologies are generally irredeemable, and that the
algorithm still works only because the variational formulation annuls some of
the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL)
divergence, a new approximate inference objective for approximation of
high-dimensional distributions. We show that motivations for variational
Bernoulli dropout based on discretisation and noise have QKL as a limit.
Properties of QKL are studied both theoretically and on a simple practical
example which shows that the QKL-optimal approximation of a full rank
Gaussian with a degenerate one naturally leads to the Principal Component
Analysis solution.

Alexander G. D. G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and
Zoubin Ghahramani.
**Gaussian process behaviour in
wide deep neural networks**.
*ICLR*, 2018.

** Abstract:** Whilst deep neural networks
have shown great empirical success, there is still much work to be done to
understand their theoretical properties. In this paper, we study the
relationship between random, wide, fully connected, feedforward networks with
more than one hidden layer and Gaussian processes with a recursive kernel
definition. We show that, under broad conditions, as we make the architecture
increasingly wide, the implied random function converges in distribution to a
Gaussian process, formalising and extending existing results by Neal (1996)
to deep networks. To evaluate convergence rates empirically, we use maximum
mean discrepancy. We then compare finite Bayesian deep networks from the
literature to Gaussian processes in terms of the key predictive quantities of
interest, finding that in some cases the agreement can be very close. We
discuss the desirability of Gaussian process behaviour and review
non-Gaussian alternative models from the literature.

Yarin Gal, Jiri Hron, and Alex Kendall.
**Concrete dropout**.
*NeurIPS*, 2017.

** Abstract:** Dropout is used as a
practical tool to obtain uncertainty estimates in large vision models and
reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty
estimates, a grid-search over the dropout probabilities is necessary—a
prohibitive operation with large models, and an impossible one with RL. We
propose a new dropout variant which gives improved performance and better
calibrated uncertainties. Relying on recent developments in Bayesian deep
learning, we use a continuous relaxation of dropout's discrete masks.
Together with a principled optimisation objective, this allows for automatic
tuning of the dropout probability in large models, and as a result faster
experimentation cycles. In RL this allows the agent to adapt its uncertainty
dynamically as more data is observed. We analyse the proposed variant
extensively on a range of tasks, and give insights into common practice in
the field where larger dropout probabilities are often used in deeper model
layers.

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, and Zoubin
Ghahramani.
**Sample-then-optimise
posterior sampling for Bayesian linear models**.
*AABI (NeurIPS workshop)*, 2017.

** Abstract:** In modern
machine learning it is common to train models which have an extremely high
intrinsic capacity. The results obtained are often i nitialization dependent,
are different for disparate optimizers and in some cases have no explicit
regularization. This raises difficult questions about generalization. A
natural approach to questions of generalization is a Bayesian one. There is
therefore a growing literature attempting to understand how Bayesian
posterior inference could emerge from the complexity of modern practice, even
without having such a procedure as the stated goal. In this work we consider
a simple special case where exact Bayesian posterior sampling emerges from
sampling (cf initialization) and then gradient descent. Specifically, for a
Bayesian linear model, if we parameterize it in terms of a deterministic
function of an isotropic normal prior, then the action of sampling from the
prior followed by first order optimization of the squared loss will give a
posterior sample. Although the assumptions are stronger than many real
problems, it still exhibits the challenging properties of redundant model
capacity and a lack of explicit regularizers, along with initialization and
optimizer dependence. It is therefore an interesting controlled test case.
Given its simplicity, the method itself may turn out to be of independent
interest from our original goal.

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán,
Daniel M Wolpert, and Máté Lengyel.
**Cognitive
tomography reveals complex task-independent mental representations**.
*Current Biology*, 23(21):2169-2175, 2013, doi
10.1016/j.cub.2013.09.012.

** Abstract:** Humans develop
rich mental representations that guide their behavior in a variety of
every-day tasks. However, it is unknown whether these representations, often
formalized as priors in Bayesian inference, are specific for each task or
subserve multiple tasks. Current approaches cannot distinguish between these
two possibilities because they cannot extract comparable representations
across different tasks. Here, we develop a novel method, termed cognitive
tomography, that can extract complex, multi-dimensional priors across tasks.
We apply this method to human judgments in two qualitatively different tasks,
familiarity and odd-one-out, involving an ecologically relevant set of
stimuli, human faces. We show that priors over faces are structurally complex
and vary dramatically across subjects, but are invariant across the tasks
within each subject. The priors we extract from each task allow us to predict
with high precision the behavior of subjects for novel stimuli both in the
same task as well as in the other task. Our results provide the first
evidence for a single high-dimensional structured representation of a
naturalistic stimulus set that guides behavior in multiple tasks. Moreover,
the representations estimated by cognitive tomography can provide
independent, behavior-based regressors for elucidating the neural correlates
of complex naturalistic priors.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc
Huszár, and Sergey Kulik.
**Experimental
adaptive Bayesian tomography**.
*Physical Review A*, 87(6):062122, 2013.

** Abstract:**
We report an experimental realization of an adaptive quantum state tomography
protocol. Our method takes advantage of a Bayesian approach to statistical
inference and is naturally tailored for adaptive strategies. For pure states
we observe close to $N^-1$ scaling of infidelity with overall number of
registered events, while best non-adaptive protocols allow for $N^-1/2$
scaling only. Experiments are performed for polarization qubits, but the
approach is readily adapted to any dimension.

Ferenc Huszár and David Duvenaud.
**Optimally-weighted herding is
Bayesian quadrature**.
In *28th Conference on Uncertainty in Artificial Intelligence*, pages
377-385, Catalina Island, California, July 2012.

**
Abstract:** Herding and kernel herding are deterministic methods of
choosing samples which summarise a probability distribution. A related task
is choosing samples for estimating integrals using Bayesian quadrature. We
show that the criterion minimised when selecting samples in kernel herding is
equivalent to the posterior variance in Bayesian quadrature. We then show
that sequential Bayesian quadrature can be viewed as a weighted version of
kernel herding which achieves performance superior to any other weighted
herding method. We demonstrate empirically a rate of convergence faster than
O(1/N). Our results also imply an upper bound on the empirical error of the
Bayesian quadrature estimate.

Ferenc Huszár and Neil Houlsby.
**Adaptive Bayesian
quantum tomography**.
*Physical Review A*, 85(5):052120, 2012.

** Abstract:**
In this paper we revisit the problem of optimal design of quantum tomographic
experiments. In contrast to previous approaches where an optimal set of
measurements is decided in advance of the experiment, we allow for
measurements to be adaptively and efficiently re-optimised depending on data
collected so far. We develop an adaptive statistical framework based on
Bayesian inference and Shannon's information, and demonstrate a ten-fold
reduction in the total number of measurements required as compared to
non-adaptive methods, including mutually unbiased bases.

**Collaborative
Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel.
**Bayesian active
learning for classification and preference learning**.
*arXiv*, abs/1112.5745, 2011.

** Abstract:** Information
theoretic active learning has been widely studied for probabilistic models.
For simple regression an optimal myopic policy is easily tractable. However,
for other tasks and with more complex models, such as classification with
nonparametric models, the optimal solution is harder to compute. Current
approaches make approximations to achieve tractability. We propose an
approach that expresses information gain in terms of predictive entropies,
and apply this method to the Gaussian Process Classifier (GPC). Our approach
makes minimal approximations to the full information theoretic objective. Our
experimental performance compares favourably to many popular active learning
algorithms, and has equal or lower computational complexity. We compare well
to decision theoretic approaches also, which are privy to more information
and require much more computational time. Secondly, by developing further a
reformulation of binary preference learning to a classification problem, we
extend our algorithm to Gaussian Process preference learning.

Ferenc Huszár and Simon Lacoste-Julien.
**A kernel approach to
tractable Bayesian nonparametrics**.
Technical report, University of Cambridge, 2011.

** Abstract:**
Inference in popular nonparametric Bayesian models typically relies on
sampling or other approximations. This paper presents a general methodology
for constructing novel tractable nonparametric Bayesian methods by applying
the kernel trick to inference in a parametric Bayesian model. For example,
Gaussian process regression can be derived this way from Bayesian linear
regression. Despite the success of the Gaussian process framework, the kernel
trick is rarely explicitly considered in the Bayesian literature. In this
paper, we aim to fill this gap and demonstrate the potential of applying the
kernel trick to tractable Bayesian parametric models in a wider context than
just regression. As an example, we present an intuitive Bayesian kernel
machine for density estimation that is obtained by applying the kernel trick
to a Gaussian generative model in feature space.

** Comment:** arXiv:1103.1761

Ferenc Huszár, Uta Noppeney, and Máté Lengyel.
**Mind reading by
machine learning: A doubly Bayesian method for inferring mental
representations**.
In S. Ohlsson and R. Catrambone, editors, *The Proceedings of the 32nd
Annual Meeting of the Cognitive Science Society*, Austin, TX, USA, August
2010. The Cognitive Science Society.

** Abstract:** A central
challenge in cognitive science is to measure and quantify the mental
representations humans develop - in other words, to 'read' subject's minds.
In order to eliminate potential biases in reporting mental contents due to
verbal elaboration, subjects' responses in experiments are often limited to
binary decisions or discrete choices that do not require conscious reflection
upon their mental contents. However, it is unclear what such impoverished
data can tell us about the potential richness and dynamics of subjects'
mental representations. To address this problem, we used ideal observer
models that formalise choice behaviour as (quasi-)Bayes-optimal, given
subjects' representations in long-term memory, acquired through prior
learning, and the stimuli currently available to them. Bayesian inversion of
such ideal observer models allowed us to infer subjects' mental
representation from their choice behaviour in a variety of psychophysical
tasks. The inferred mental representations also allowed us to predict future
choices of subjects with reasonable accuracy, even in tasks that were
different from those in which the representations were estimated. These
results demonstrate a significant potential in standard binary decision tasks
to recover detailed information about subjects' mental representations

** Comment:** Supplementary material available here.

Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza,
Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas
Abdolmaleki, Nicolas Heess, Josh Merel, and Martin Riedmiller.
**Evaluating model-based
planning and planner amortization for continuous control**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** There is a widespread intuition that model-based
control methods should be able to surpass the data efficiency of model-free
approaches. In this paper we attempt to evaluate this intuition on various
challenging locomotion tasks. We take a hybrid approach, combining model
predictive control (MPC) with a learned model and model-free policy learning;
the learned policy serves as a proposal for MPC. We show that MPC with
learned proposals and models (trained on the fly or transferred from related
tasks) can significantly improve performance and data efficiency with respect
to model-free methods. However, we find that well-tuned model-free agents are
strong baselines even for high DoF control problems. Finally, we show that it
is possible to distil a model-based planner into a policy that amortizes the
planning computation without any loss of performance.

Alessandro Davide Ialongo.
**Variational
Inference in Dynamical Systems**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2022.

** Abstract:** Dynamical systems are a powerful
formalism to analyse the world around us. Many datasets are sequential in
nature, and can be described by a discrete time evolution law. We are
interested in approaching the analysis of such datasets from a probabilistic
perspective. We would like to maintain justified beliefs about quantities
which, though useful in explaining the behaviour of a system, may not be
observable, as well as about the system's evolution itself, especially in
regimes we have not yet observed in our data. The framework of statistical
inference gives us the tools to do so, yet, for many systems of interest,
performing inference exactly is not computationally or analytically
tractable. The contribution of this thesis, then, is twofold: first, we
uncover two sources of bias in existing variational inference methods applied
to dynamical systems in general, and state space models whose transition
function is drawn from a Gaussian process (GPSSM) in particular. We show bias
can derive from assuming posteriors in non-linear systems to be jointly
Gaussian, and from assuming that we can sever the dependence between latent
states and transition function in state space model posteriors. Second, we
propose methods to address these issues, undoing the resulting biases. We do
this without compromising on computational efficiency or on the ability to
scale to larger datasets and higher dimensions, compared to the methods we
rectify. One method, the Markov Autoregressive Flow (Markov AF) addresses the
Gaussian assumption, by providing a more flexible class of posteriors, based
on normalizing flows, which can be easily evaluated, sampled, and optimised.
The other method, Variationally Coupled Dynamics and Trajectories (VCDT),
tackles the factorisation assumption, leveraging sparse Gaussian processes
and their variational representation to reintroduce dependence between latent
states and the transition function at no extra computational cost. Since the
objective of inference is to maintain calibrated beliefs, if we employed
approximations which are significantly biased in non-linear, noisy systems,
or when there is little data available, we would have failed in our
objective, as those are precisely the regimes in which uncertainty
quantification is all the more important. Hence we think it is essential, if
we wish to act optimally on such beliefs, to uncover, and, if possible, to
correct, all sources of systematic bias in our inference methods.

Joseph Marino, Alexandre Piche, Alessandro Davide Ialongo, and Yisong Yue.
**Iterative
amortized policy optimization**.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan,
editors, *Advances in Neural Information Processing Systems 34*,
volume 34, pages 15667-15681. Curran Associates, Inc., 2021.

** Abstract:** Policy networks are a central feature of deep
reinforcement learning (RL) algorithms for continuous control, enabling the
estimation and sampling of high-value actions. From the variational inference
perspective on RL, policy networks, when used with entropy or KL
regularization, are a form of amortized optimization, optimizing network
parameters rather than the policy distributions directly. However, direct
amortized mappings can yield suboptimal policy estimates and restricted
distributions, limiting performance and exploration. Given this perspective,
we consider the more flexible class of iterative amortized optimizers. We
demonstrate that the resulting technique, iterative amortized policy
optimization, yields performance improvements over direct amortization on
benchmark continuous control tasks.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Overcoming mean-field
approximations in recurrent Gaussian process models**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** We identify a new variational inference
scheme for dynamical systems whose transition function is modelled by a
Gaussian process. Inference in this setting has either employed
computationally intensive MCMC methods, or relied on factorisations of the
variational posterior. As we demonstrate in our experiments, the
factorisation between latent system states and transition function can lead
to a miscalibrated posterior and to learning unnecessarily large noise terms.
We eliminate this factorisation by explicitly modelling the dependence
between state trajectories and the Gaussian process posterior. Samples of the
latent states can then be tractably generated by conditioning on this
representation. The method we obtain (VCDT: variationally coupled dynamics
and trajectories) gives better predictive performance and more calibrated
estimates of the transition function, yet maintains the same time and space
complexities as mean-field methods. Code is available at:
https://github.com/ialong/GPt.

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, and Carl Edward
Rasmussen.
**Non-factorised variational
inference in dynamical systems**.
In *First Symposium on Advances in Approximate Bayesian Inference*,
Montreal, December 2018.

** Abstract:** We focus on
variational inference in dynamical systems where the discrete time transition
function (or evolution rule) is modelled by a Gaussian process. The dominant
approach so far has been to use a factorised posterior distribution,
decoupling the transition function from the system states. This is not exact
in general and can lead to an overconfident posterior over the transition
function as well as an overestimation of the intrinsic stochasticity of the
system (process noise). We propose a new method that addresses these issues
and incurs no additional computational costs.

Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil,
Andrea Rocchetto, Simone Severini, and Leonard Wossnig.
**Quantum machine learning: a classical perspective**.
In *Proc. R. Soc. A*, volume 474, page 20170551. The Royal Society,
2018, doi
10.1098/rspa.2017.0551.

** Abstract:** Recently, increased
computational power and data availability, as well as algorithmic advances,
have led machine learning techniques to impressive results in regression,
classification, data-generation and reinforcement learning tasks. Despite
these successes, the proximity to the physical limits of chip fabrication
alongside the increasing size of datasets are motivating a growing number of
researchers to explore the possibility of harnessing the power of quantum
computation to speed-up classical machine learning algorithms. Here we review
the literature in quantum machine learning and discuss perspectives for a
mixed readership of classical machine learning and quantum computation
experts. Particular emphasis will be placed on clarifying the limitations of
quantum algorithms, how they compare with their best classical counterparts
and why quantum resources are expected to provide advantages for learning
problems. Learning in the presence of noise and certain computationally hard
problems in machine learning are identified as promising directions for the
field. Practical questions, like how to upload classical data into quantum
form, will also be addressed.

Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen.
**Closed-form inference and
prediction in Gaussian process state-space models**.
In *NIPS Time Series Workshop 2017*, Long Beach, December 2017.

** Abstract:** We examine an analytic variational inference
scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic
model for system identification and time-series modelling. Our approach
performs variational inference over both the system states and the transition
function. We exploit Markov structure in the true posterior, as well as an
inducing point approximation to achieve linear time complexity in the length
of the time series. Contrary to previous approaches, no Monte Carlo sampling
is required: inference is cast as a deterministic optimisation problem. In a
number of experiments, we demonstrate the ability to model non-linear
dynamics in the presence of both process and observation noise as well as to
impute missing information (e.g. velocities from raw positions through time),
to de-noise, and to estimate the underlying dimensionality of the system.
Finally, we also introduce a closed-form method for multi-step prediction,
and a novel criterion for assessing the quality of our approximate
posterior.

**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

L. Gresele*, J. von Kügelgen*, J. M. Kübler*, E. Kirschbaum,
B. Schölkopf, and D. Janzing.
**Causal
inference through the structural causal marginal problem**.
In *39th International Conference on Machine Learning*, volume 162,
pages 7793-7824. PMLR, 2022.
*equal contribution.

** Abstract:** We introduce an approach to
counterfactual inference based on merging information from multiple datasets.
We consider a causal reformulation of the statistical marginal problem: given
a collection of marginal structural causal models (SCMs) over distinct but
overlapping sets of variables, determine the set of joint SCMs that are
counterfactually consistent with the marginal ones. We formalise this
approach for categorical SCMs using the response function formulation and
show that it reduces the space of allowed marginal and joint SCMs. Our work
thus highlights a new mode of falsifiability through additional variables, in
contrast to the statistical one via additional data.

O. Makansi, J. von Kügelgen, F. Locatello, P. Gehler, D. Janzing, T. Brox,
and B. Schölkopf.
**You mostly walk alone:
Analyzing feature attribution in trajectory prediction**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Predicting the future trajectory of a moving
agent can be easy when the past trajectory continues smoothly but is
challenging when complex interactions with other agents are involved. Recent
deep learning approaches for trajectory prediction show promising performance
and partially attribute this to successful reasoning about agent-agent
interactions. However, it remains unclear which features such black-box
models actually learn to use for making predictions. This paper proposes a
procedure that quantifies the contributions of different cues to model
performance based on a variant of Shapley values. Applying this procedure to
state-of-the-art trajectory prediction methods on standard benchmark datasets
shows that they are, in fact, unable to reason about interactions. Instead,
the past trajectory of the target is the only feature used for predicting its
future. For a task with richer social interaction patterns, on the other
hand, the tested models do pick up such interactions to a certain extent, as
quantified by our feature attribution method. We discuss the limits of the
proposed method and its links to causality.

David Janz, David Burt, and Javier Gonzalez.
**Bandit
optimisation of functions in the Matérn kernel RKHS**.
In *23rd International Conference on Artificial Intelligence and
Statistics*, 2020.

** Abstract:** We consider the problem
of optimising functions in the reproducing kernel Hilbert space (RKHS) of a
Matérn kernel with smoothness parameter $u$ over the domain $[0,1]^d$ under
noisy bandit feedback. Our contribution, the $π$-GP-UCB algorithm, is the
first practical approach with guaranteed sublinear regret for all $u>1$
and $d \geq 1$. Empirical validation suggests better performance and
drastically improved computational scalablity compared with its predecessor,
Improved GP-UCB.

David Janz, Jiri Hron, Przemyslaw Mazur, José Miguel Hernández-Lobato,
Katja Hofmann, and Sebastian Tschiatschek.
**Successor Uncertainties:
exploration and uncertainty in temporal difference learning**.
*NeurIPS*, 2019.

** Abstract:** Posterior sampling for
reinforcement learning (PSRL) is an effective method for balancing
exploration and exploitation in reinforcement learning. Randomised value
functions (RVF) can be viewed as a promising approach to scaling PSRL.
However, we show that most contemporary algorithms combining RVF with neural
network function approximation do not possess the properties which make PSRL
effective, and provably fail in sparse reward problems. Moreover, we find
that propagation of uncertainty, a property of PSRL previously thought
important for exploration, does not preclude this failure. We use these
insights to design Successor Uncertainties (SU), a cheap and easy to
implement RVF algorithm that retains key properties of PSRL. SU is highly
effective on hard tabular exploration benchmarks. Furthermore, on the Atari
2600 domain, it surpasses human performance on 38 of 49 games tested
(achieving a median human normalised score of 2.09), and outperforms its
closest RVF competitor, Bootstrapped DQN, on 36 of those.

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, and Sarah Dean.
**Modelling content creator
incentives on algorithm-curated platforms**.
*arXiv*, 2022.

** Abstract:** Content creators compete
for user attention. Their reach crucially depends on algorithmic choices made
by developers on online platforms. To maximize exposure, many creators adapt
strategically, as evidenced by examples like the sprawling search engine
optimization industry. This begets competition for the finite user attention
pool. We formalize these dynamics in what we call an exposure game, a model
of incentives induced by algorithms including modern factorization and (deep)
two-tower architectures. We prove that seemingly innocuous algorithmic
choices—e.g., non-negative vs. unconstrained factorization—significantly
affect the existence and character of (Nash) equilibria in exposure games. We
proffer use of creator behavior models like ours for an (ex-ante)
pre-deployment audit. Such an audit can identify misalignment between
desirable and incentivized content, and thus complement post-hoc measures
like content filtering and moderation. To this end, we propose tools for
numerically finding equilibria in exposure games, and illustrate results of
an audit on the MovieLens and LastFM datasets. Among else, we find that the
strategically produced content exhibits strong dependence between algorithmic
exploration and content diversity, and between model expressivity and bias
towards gender-based user and creator groups.

Jiri Hron, Karl Krauth, Michael I. Jordan, and Niki Kilbertus.
**On component interactions in
two-stage recommender systems**.
*NeurIPS*, 2021.

** Abstract:** Thanks to their
scalability, two-stage recommenders are used by many of today's largest
online platforms, including YouTube, LinkedIn, and Pinterest. These systems
produce recommendations in two steps: (i) multiple nominators, tuned for low
prediction latency, preselect a small subset of candidates from the whole
item pool; (ii) a slower but more accurate ranker further narrows down the
nominated items, and serves to the user. Despite their popularity, the
literature on two-stage recommenders is relatively scarce, and the algorithms
are often treated as mere sums of their parts. Such treatment presupposes
that the two-stage performance is explained by the behavior of the individual
components in isolation. This is not the case: using synthetic and real-world
data, we demonstrate that interactions between the ranker and the nominators
substantially affect the overall performance. Motivated by these findings, we
derive a generalization lower bound which shows that independent nominator
training can lead to performance on par with uniformly random
recommendations. We find that careful design of item pools, each assigned to
a different nominator, alleviates these issues. As manual search for a good
pool allocation is difficult, we propose to learn one instead using a
Mixture-of-Experts based approach. This significantly improves both precision
and recall at K.

Jiri Hron, Karl Krauth, Michael I. Jordan, and Niki Kilbertus.
**Exploration in two-stage
recommender systems**.
*REVEAL (ACM RecSys workshop)*, 2020.

** Abstract:**
Two-stage recommender systems are widely adopted in industry due to their
scalability and maintainability. These systems produce recommendations in two
steps: (i) multiple nominators preselect a small number of items from a large
pool using cheap-to-compute item embeddings; (ii) with a richer set of
features, a ranker rearranges the nominated items and serves them to the
user. A key challenge of this setup is that optimal performance of each stage
in isolation does not imply optimal global performance. In response to this
issue, Ma et al. (2020) proposed a nominator training objective importance
weighted by the ranker's probability of recommending each item. In this work,
we focus on the complementary issue of exploration. Modeled as a contextual
bandit problem, we find LinUCB (a near optimal exploration strategy for
single-stage systems) may lead to linear regret when deployed in two-stage
recommenders. We therefore propose a method of synchronising the exploration
strategies between the ranker and the nominators. Our algorithm only relies
on quantities already computed by standard LinUCB at each stage and can be
implemented in three lines of additional code. We end by demonstrating the
effectiveness of our algorithm experimentally.

Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet,
and Isabel Valera.
**Fair decisions
despite imperfect predictions**.
In Silvia Chiappa and Roberto Calandra, editors, *23rd International
Conference on Artificial Intelligence and Statistics*, volume 108 of
*Proceedings of Machine Learning Research*, pages 277-287. PMLR,
26-28 Aug 2020.

** Abstract:** Consequential decisions are
increasingly informed by sophisticated data-driven predictive models.
However, consistently learning accurate predictive models requires access to
ground truth labels. Unfortunately, in practice, labels may only exist
conditional on certain decisions—if a loan is denied, there is not even an
option for the individual to pay back the loan. In this paper, we show that,
in this selective labels setting, learning to predict is suboptimal in terms
of both fairness and utility. To avoid this undesirable behavior, we propose
to directly learn stochastic decision policies that maximize utility under
fairness constraints. In the context of fair machine learning, our results
suggest the need for a paradigm shift from "learning to predict" to "learning
to decide". Experiments on synthetic and real-world data illustrate the
favorable properties of learning to decide, in terms of both utility and
fairness.

Timothy Gebhard, Niki Kilbertus, Ian Harry, and Bernhard Schölkopf.
**Convolutional
neural networks: A magic bullet for gravitational-wave detection?**.
*Physical Review D*, 100(6):063015, September 2019, doi
https://doi.org/10.1103/PhysRevD.100.063015.

**
Abstract:** In the last few years, machine learning techniques, in
particular convolutional neural networks, have been investigated as a method
to replace or complement traditional matched filtering techniques that are
used to detect the gravitational-wave signature of merging black holes.
However, to date, these methods have not yet been successfully applied to the
analysis of long stretches of data recorded by the Advanced LIGO and Virgo
gravitational-wave observatories. In this work, we critically examine the use
of convolutional neural networks as a tool to search for merging black holes.
We identify the strengths and limitations of this approach, highlight some
common pitfalls in translating between machine learning and
gravitational-wave astronomy, and discuss the interdisciplinary challenges.
In particular, we explain in detail why convolutional neural networks alone
cannot be used to claim a statistically significant gravitational-wave
detection. However, we demonstrate how they can still be used to rapidly flag
the times of potential signals in the data for a more detailed follow-up. Our
convolutional neural network architecture as well as the proposed performance
metrics are better suited for this task than a standard binary
classifications scheme. A detailed evaluation of our approach on Advanced
LIGO data demonstrates the potential of such systems as trigger generators.
Finally, we sound a note of caution by constructing adversarial examples,
which showcase interesting "failure modes" of our model, where inputs with no
visible resemblance to real gravitational-wave signals are identified as such
by the network with high confidence.

Niki Kilbertus, Phil Ball, Matt Kusner, Adrian Weller, and Ricardo Silva.
**The sensitivity of counterfactual
fairness to unmeasured confounding**.
In *35th Conference on Uncertainty in Artificial Intelligence*, Tel
Aviv, July 2019.

** Abstract:** Causal approaches to fairness
have seen substantial recent interest, both from the machine learning
community and from wider parties interested in ethical prediction algorithms.
In no small part, this has been due to the fact that causal models allow one
to simultaneously leverage data and expert knowledge to remove discriminatory
effects from predictions. However, one of the primary assumptions in causal
modeling is that you know the causal graph. This introduces a new opportunity
for bias, caused by misspecifying the causal model. One common way for
misspecification to occur is via unmeasured confounding: the true causal
effect between variables is partially described by unobserved quantities. In
this work we design tools to assess the sensitivity of fairness measures to
this confounding for the popular class of non-linear additive noise models
(ANMs). Specifically, we give a procedure for computing the maximum
difference between two counterfactually fair predictors, where one has become
biased due to confounding. For the case of bivariate confounding our
technique can be swiftly computed via a sequence of closed-form updates. For
multivariate confounding we give an algorithm that can be efficiently solved
via automatic differentiation. We demonstrate our new sensitivity analysis
tools in real-world fairness scenarios to assess the bias arising from
confounding.

Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna P. Gummadi,
and Adrian Weller.
**Blind
justice: Fairness with encrypted sensitive attributes**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Recent work has explored how
to train machine learning models which do not discriminate against any
subgroup of the population as determined by sensitive attributes such as
gender or race. To avoid disparate treatment, sensitive attributes should not
be considered. On the other hand, in order to avoid disparate impact,
sensitive attributes must be examined — e.g., in order to learn a fair
model, or to check if a given model is fair. We introduce methods from secure
multi-party computation which allow us to avoid both. By encrypting sensitive
attributes, we show how an outcome based fair model may be learned, checked,
or have its outputs verified and held to account, without users revealing
their sensitive attributes.

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard
Schölkopf.
**Learning
independent causal mechanisms**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** Statistical learning relies
upon data sampled from a distribution, and we usually do not care what
actually generated it in the first place. From the point of view of causal
modeling, the structure of each distribution is induced by physical
mechanisms that give rise to dependences between observables. Mechanisms,
however, can be meaningful autonomous modules of generative models that make
sense beyond a particular entailed data distribution, lending themselves to
transfer between problems. We develop an algorithm to recover a set of
independent (inverse) mechanisms from a set of transformed data points. The
approach is unsupervised and based on a set of experts that compete for data
generated by the mechanisms, driving specialization. We analyze the proposed
method in a series of experiments on image data. Each expert learns to map a
subset of the transformed data back to a reference distribution. The learned
mechanisms generalize to novel domains. We discuss implications for transfer
learning and links to recent trends in generative modeling.

Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf.
**Avoiding discrimination
through causal reasoning**.
In *Advances in Neural Information Processing Systems 30*, Long Beach,
California, December 2017.

** Abstract:** Recent work on
fairness in machine learning has focused on various statistical
discrimination criteria and how they trade off. Most of these criteria are
observational: They depend only on the joint distribution of predictor,
protected attribute, features, and outcome. While convenient to work with,
observational criteria have severe inherent limitations that prevent them
from resolving matters of fairness conclusively. Going beyond observational
criteria, we frame the problem of discrimination based on protected
attributes in the language of causal reasoning. This viewpoint shifts
attention from ``What is the right fairness criterion?'' to ``What do we want
to assume about our model of the causal data generating process?'' Through
the lens of causality, we make several contributions. First, we crisply
articulate why and when observational criteria fail, thus formalizing what
was before a matter of opinion. Second, our approach exposes previously
ignored subtleties and why they are fundamental to the problem. Finally, we
put forward natural causal non-discrimination criteria and develop algorithms
that satisfy them.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta diffusion trees and
hierarchical feature allocations**.
Technical report, Dept. of Engineering, University of Cambridge, August 2014.

** Abstract:** We define the beta diffusion tree, a random tree
structure with a set of leaves that defines a collection of overlapping
subsets of objects, known as a feature allocation. A generative process for
the tree structure is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet
diffusion tree (Neal, 2003b), which defines a tree structure over partitions
(i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet
diffusion tree, multiple copies of a particle may exist and diffuse along
multiple branches in the beta diffusion tree, and an object may therefore
belong to multiple subsets of particles. We demonstrate how to build a
hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression microarrays,
international development statistics, and intranational socioeconomic
measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani.
**Beta
diffusion trees**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We define the beta diffusion tree, a
random tree structure with a set of leaves that defines a collection of
overlapping subsets of objects, known as a feature allocation. The generative
process for the tree is defined in terms of particles (representing the
objects) diffusing in some continuous space, analogously to the Dirichlet and
Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both
of which define tree structures over clusters of the particles. With the beta
diffusion tree, however, multiple copies of a particle may exist and diffuse
to multiple locations in the continuous space, resulting in (a random number
of) possibly overlapping clusters of the objects. We demonstrate how to build
a hierarchically-clustered factor analysis model with the beta diffusion tree
and how to perform inference over the random tree structures with a Markov
chain Monte Carlo algorithm. We conclude with several numerical experiments
on missing data problems with data sets of gene expression arrays,
international development statistics, and intranational socioeconomic
measurements.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A reversible
infinite hmm using normalised random measures**.
In *31st International Conference on Machine Learning*, Beijing, China,
June 2014.

** Abstract:** We present a nonparametric prior
over reversible Markov chains. We use completely random measures,
specifically gamma processes, to construct a countably infinite graph with
weighted edges. By enforcing symmetry to make the edges undirected we define
a prior over random walks on graphs that results in a reversible Markov
chain. The resulting prior over infinite transition matrices is closely
related to the hierarchical Dirichlet process but enforces reversibility. A
reinforcement scheme has recently been proposed with similar properties, but
the de Finetti measure is not well characterised. We take the alternative
approach of explicitly constructing the mixing measure, which allows more
straightforward and efficient inference at the cost of no longer having a
closed form predictive distribution. We use our process to construct a
reversible infinite HMM which we apply to two real datasets, one from
epigenomics and one ion channel recording.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27th Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

David A. Knowles and Thomas P. Minka.
**Non-conjugate
variational message passing for multinomial and binary regression**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Variational Message Passing (VMP) is an
algorithmic implementation of the Variational Bayes (VB) method which applies
only in the special case of conjugate exponential family models. We propose
an extension to VMP, which we refer to as Non-conjugate Variational Message
Passing (NCVMP) which aims to alleviate this restriction while maintaining
modularity, allowing choice in how expectations are calculated, and
integrating into an existing message-passing framework: Infer.NET. We
demonstrate NCVMP on logistic binary and multinomial regression. In the
multinomial case we introduce a novel variational bound for the softmax
factor which is tighter than other commonly used bounds whilst maintaining
computational tractability.

** Comment:** web site supplementary

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Inferring a
measure of physiological age from multiple ageing related phenotypes**.
In *NIPS Workshop: From Statistical Genetics to Predictive Models in
Personalized Medicine*, 2011.

** Abstract:** What is
ageing? One definition is simultaneous degradation of multiple organ systems.
Can an individual be said to be "old" or "young" for their (chronological)
age in a scientifically meaningful way? We investigate these questions using
ageing related phenotypes measured on the 12,000 female twins in the Twins UK
study. We propose a simple linear model of ageing, which allows a latent
adjustment to be made to an individual's chronological age to give her
"physiological age", shared across the observed phenotypes. We note problems
with the analysis resulting from the linearity assumption and show how to
alleviate these issues using a non-linear extension. We find more gene
expression probes are significantly associated with our measurement of
physiological age than to chronological age.

** Comment:** web site

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider,
Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin
Goddard, Pietro Lio, Martin Bennett, and Roger Foo.
**Distinct
epigenomic features in human cardiomyopathy**.
*Circulation, American Heart Association*, 2011.

**
Abstract:** Background. The epigenome refers to marks on the genome
including DNA methylation and histone modifications that regulate the
expression of underlying genes. A consistent profile of gene expression
changes in end- stage cardiomyopathy led us to hypothesise that distinct
global patterns of the epigenome may also exist. Methods and Results. We
constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36
tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human
hearts. 506Mb of sequence per library was generated by high-throughput
sequencing, covering 24 million out of the 28 million CG di-nucleotides in
the human genome. DNA methylation was significantly different in promoter
CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions
of the genome. Moreover DNA methylation differences were present in promoters
of upregulated genes but not down-regulated genes. The profile of
H3K36me3-enrichment itself was also significantly different in protein-coding
regions of the genome. Conclusions. Distinct epigenomic patterns exist in
important DNA elements of the human cardiac genome in end-stage
cardiomyopathy. If epigenomic patterns track with disease progression, assays
for the epigenome may be more useful than quantification of mRNA for
assessing prognosis in heart failure. These results open up an important new
horizon of research and further studies will be needed to determine how
epigenomics contribute to altered gene expression in cardiomyopathy.

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, and Denis
Burdakov.
**Dichotomous
cellular properties of mouse orexin/hypocretin neurons**.
*The Journal of Physiology*, 2011.

** Abstract:**
Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as
critical regulators of sleep-wake cycles, reward-seeking, and body energy
balance. However, at the level of cellular and network properties, it remains
unclear whether hcrt/orx neurons are one homogenous population, or whether
there are several distinct types of hcrt/orx cells. Here, we collated diverse
structural and functional information about individual hcrt/orx neurons in
mouse brain slices, by combining patch-clamp analysis of spike firing,
membrane currents, and synaptic inputs with confocal imaging of cell shape
and subsequent 3-dimensional Sholl analysis of dendritic architecture.
Statistical cluster analysis of intrinsic firing properties revealed that
hcrt/orx neurons fall into two distinct types. These two cell types also
differ in the complexity of their dendritic arbour, the strength of AMPA and
GABAA receptor-mediated synaptic drive that they receive, and the density of
low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results
provide quantitative evidence that, at the cellular level, the mouse hcrt/orx
system is composed of two classes of neurons with different firing
properties, morphologies, and synaptic input organization.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, , and Tim D.
Spector.
**No correlation between childhood maltreatment and telomere length.**.
*Biological Psychiatry*, 68(6):21-22, 2010.

**
Abstract:** Telomeres are lengths of repetitive DNA that cap the ends of
chromosomes. They protect the ends of the chromosome and shorten with each
cell division. Short leukocyte telomere length has been related to a number
of age-related diseases. In addition, shorter telomere length has been
associated with environmental factors such as smoking and lack of exercise.
In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a
report suggesting a link between maltreatment in childhood and telomere
shortening in 31 subjects. Individuals who had suffered maltreatment had
telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had
not been abused.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Modeling skin
and ageing phenotypes using latent variable models in infer.net**.
In *NIPS Workshop: Predictive Models in Personalized Medicine Workshop*,
2010.

** Abstract:** We demonstrate and compare three
unsupervised Bayesian latent variable models implemented in Infer.NET for
biomedical data modeling of 42 skin and ageing phenotypes measured on the
12,000 female twins in the Twins UK study. We address various data modeling
problems include high missingness, heterogeneous data, and repeat
observations. We compare the proposed models in terms of their performance at
predicting disease labels and symptoms from available explanatory variables,
concluding that factor analysis type models have the strongest statistical
performance in this setting. We show that such models can be combined with
regression components for improved interpretability.

** Comment:** web
site

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

David A. Knowles and Susan Holmes.
**Statistical tools
for ultra-deep pyrosequencing of fast evolving viruses**.
In *NIPS Workshop: Computational Biology*, 2009.

**
Abstract:** We aim to detect minor variant Hepatitis B viruses (HBV) in 38
pyrosequencing samples from infected individuals. Errors involved in the
amplification and ultra deep pyrosequencing (UDPS) of these samples are
characterised using HBV plasmid controls. Homopolymeric regions and quality
scores are found to be significant covariates in determining insertion and
deletion (indel) error rates, but not mismatch rates which depend on the
nucleotide transition matrix. This knowledge is used to derive two methods
for classifying genuine mutations: a hypothesis testing framework and a
mixture model. Using an approximate "ground truth" from a limiting dilution
Sanger sequencing run, these methods are shown to outperform the naive
percentage threshold approach. The possibility of early stage PCR errors
becoming significant is investigated by simulation, which underlines the
importance of the initial copy number.

** Comment:** web site

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham,
Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent,
Rodolphe Jenatton, and Effrosyni Kokiopoulou.
**Deep classifiers with
label noise modeling and distance awareness**.
*Transactions on Machine Learning Research*, 2022.

**
Abstract:** Uncertainty estimation in deep learning has recently emerged as
a crucial area of interest to advance reliability and robustness in
safety-critical applications. While there have been many proposed methods
that either focus on distance-aware model uncertainties for
out-of-distribution detection or on input-dependent label uncertainties for
in-distribution calibration, both of these types of uncertainty are often
necessary. In this work, we propose the HetSNGP method for jointly modeling
the model and data uncertainty. We show that our proposed model affords a
favorable combination between these two types of uncertainty and thus
outperforms the baseline methods on some challenging out-of-distribution
datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we
propose HetSNGP Ensemble, an ensembled version of our method which
additionally models uncertainty over the network parameters and outperforms
other ensemble baselines.

** Comment:** Code

Manon Kok and Arno Solin.
**Scalable magnetic field slam in
3d using gaussian process maps**.
In *Proceedings of the 21th International Conference on Information Fusion
(accepted for publication)*, Cambridge, UK, July 2018.

**
Abstract:** We present a method for scalable and fully 3D magnetic field
simultaneous localisation and mapping (SLAM) using local anomalies in the
magnetic field as a source of position information. These anomalies are due
to the presence of ferromagnetic material in the structure of buildings and
in objects such as furniture. We represent the magnetic field map using a
Gaussian process model and take well-known physical properties of the
magnetic field into account. We build local magnetic field maps using
three-dimensional hexagonal block tiling. To make our approach
computationally tractable we use reduced-rank Gaussian process regression in
combination with a Rao-Blackwellised particle filter. We show that it is
possible to obtain accurate position and orientation estimates using
measurements from a smartphone, and that our approach provides a scalable
magnetic SLAM algorithm in terms of both computational complexity and map
storage.

Martin A. Skoglund, Zoran Sjanic, and Manon Kok.
**On
orientation estimation using iterative methods in Euclidean space**.
In *Proceedings of the 20th International Conference on Information
Fusion*, Xi'an, China, July 2017. doi
10.23919/ICIF.2017.8009830.

** Abstract:** This paper
presents three iterative methods for orientation estimation. The first two
are based on iterated Extended Kalman filter (IEKF) formulations with
different state representations. The first is using the well-known unit
quaternion as state (q-IEKF) while the other is using orientation deviation
which we call IMEKF. The third method is based on nonlinear least squares
(NLS) estimation of the angular velocity which is used to parametrise the
orientation. The results are obtained using Monte Carlo simulations and the
comparison is done with the non-iterative EKF and multiplicative EKF (MEKF)
as baseline. The result clearly shows that the IMEKF and the NLS-based method
are superior to q-IEKF and all three outperform the non-iterative
methods.

Manon Kok, Jeroen D. Hol, and Thomas B. Schön.
**Using
inertial sensors for position and orientation estimation**.
*Foundations and Trends in Signal Processing*, 11(1-2):1-153, 2017.

** Abstract:** In recent years, MEMS inertial sensors (3D
accelerometers and 3D gyroscopes) have become widely available due to their
small size and low cost. Inertial sensor measurements are obtained at high
sampling rates and can be integrated to obtain position and orientation
information. These estimates are accurate on a short time scale, but suffer
from integration drift over longer time scales. To overcome this issue,
inertial sensors are typically combined with additional sensors and models.
In this tutorial we focus on the signal processing aspects of position and
orientation estimation using inertial sensors. We discuss different modeling
choices and a selected number of important algorithms. The algorithms include
optimization-based smoothing and filtering as well as computationally cheaper
extended Kalman filter and complementary filter implementations. The quality
of their estimates is illustrated using both experimental and simulated
data.

** Comment:** arXiv

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger.
**Defining and characterizing
reward hacking**.
In *Advances in Neural Information Processing Systems 35*, 2022.

Lauro Langosco di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, and David
Krueger.
**Goal misgeneralization in deep
reinforcement learning**.
In *icml2022*, 2022.

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger.
**Defining and characterizing
reward hacking**.
In *Advances in Neural Information Processing Systems 35*, 2022.

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, and
Sara Hooker.
**Metadata archaeology: Unearthing
data subsets by leveraging training dynamics**.
*arXiv preprint arXiv:2209.10015*, 2022.

** Abstract:**
Modern machine learning research relies on relatively few carefully curated
datasets. Even in these datasets, and typically in `untidy' or raw data,
practitioners are faced with significant issues of data quality and diversity
which can be prohibitively labor intensive to address. Existing methods for
dealing with these challenges tend to make strong assumptions about the
particular issues at play, and often require a priori knowledge or metadata
such as domain labels. Our work is orthogonal to these methods: we instead
focus on providing a unified and efficient framework for Metadata Archaeology
- uncovering and inferring metadata of examples in a dataset. We curate
different subsets of data that might exist in a dataset (e.g. mislabeled,
atypical, or out-of-distribution examples) using simple transformations, and
leverage differences in learning dynamics between these probe suites to infer
metadata of interest. Our method is on par with far more sophisticated
mitigation methods across different tasks: identifying and correcting
mislabeled examples, classifying minority-group samples, prioritizing points
relevant for training and enabling scalable human auditing of relevant
examples.

** Comment:** Project webpage: https://metadata-archaeology.github.io/

C. Eastwood, A. Robey, S. Singh, J. von Kügelgen, H. Hassani, G. J. Pappas,
and B. Schölkopf.
**Probable domain generalization
via quantile risk minimization**.
In *Advances in Neural Information Processing Systems 35*. Curran
Associates, Inc., 2022.

** Abstract:** Domain generalization
(DG) seeks predictors which perform well on unseen test distributions by
leveraging data drawn from multiple related training distributions or
domains. To achieve this, DG is commonly formulated as an average- or
worst-case problem over the set of possible domains. However, predictors that
perform well on average lack robustness while predictors that perform well in
the worst case tend to be overly-conservative. To address this, we propose a
new probabilistic framework for DG where the goal is to learn predictors that
perform well with high probability. Our key idea is that distribution shifts
seen during training should inform us of probable shifts at test time, which
we realize by explicitly relating training and test domains as draws from the
same underlying meta-distribution. To achieve probable DG, we propose a new
optimization problem called Quantile Risk Minimization (QRM). By minimizing
the $α$-quantile of predictor's risk distribution over domains, QRM
seeks predictors that perform well with probability $α$. To solve QRM in
practice, we propose the Empirical QRM (EQRM) algorithm, and prove: (i) a
generalization bound for EQRM; and (ii) that EQRM recovers the causal
predictor as $α$->1. In our experiments, we introduce a more holistic
quantile-focused evaluation protocol for DG, and demonstrate that EQRM
outperforms state-of-the-art baselines on CMNIST and several datasets from
WILDS and DomainBed.

L. Gresele*, J. von Kügelgen*, J. M. Kübler*, E. Kirschbaum,
B. Schölkopf, and D. Janzing.
**Causal
inference through the structural causal marginal problem**.
In *39th International Conference on Machine Learning*, volume 162,
pages 7793-7824. PMLR, 2022.
*equal contribution.

** Abstract:** We introduce an approach to
counterfactual inference based on merging information from multiple datasets.
We consider a causal reformulation of the statistical marginal problem: given
a collection of marginal structural causal models (SCMs) over distinct but
overlapping sets of variables, determine the set of joint SCMs that are
counterfactually consistent with the marginal ones. We formalise this
approach for categorical SCMs using the response function formulation and
show that it reduces the space of allowed marginal and joint SCMs. Our work
thus highlights a new mode of falsifiability through additional variables, in
contrast to the statistical one via additional data.

F. Laumann, J. von Kügelgen, T. H. Kanashiro Uehara, and M. Barahona.
**Complex
interlinkages, key objectives and nexuses amongst the sustainable development
goals and climate change: a network analysis**.
*The Lancet Planetary Health*, 6(5):E422-E430, 2022, doi
10.1016/S2542-5196(22)00070-5.

** Abstract:** Background:
Global sustainability is an enmeshed system of complex socioeconomic,
climatological, and ecological interactions. The numerous objectives of the
UN’s Sustainable Development Goals (SDGs) and the Paris Agreement have
various levels of interdependence, making it difficult to ascertain the
influence of changes to particular indicators across the whole system. In
this analysis, we aimed to detect and rank the complex interlinkages between
objectives of sustainability agendas. Methods: We developed a method to find
interlinkages among the 17 SDGs and climate change, including non-linear and
non-monotonic dependences. We used time series of indicators defined by the
World Bank, consisting of 400 indicators that measure progress towards the 17
SDGs and an 18th variable (annual average temperatures), representing
progress in the response to the climate crisis, from 2000 to 2019. This
method detects significant dependencies among the time evolution of the
objectives by using partial distance correlations, a non-linear measure of
conditional dependence that also discounts spurious correlations originating
from lurking variables. We then used a network representation to identify the
most important objectives (using network centrality) and to obtain nexuses of
objectives (defined as highly interconnected clusters in the network).
Findings: Using temporal data from 181 countries spanning 20 years, we
analysed dependencies among SDGs and climate for 35 country groupings based
on region, development, and income level. The observed significant
interlinkages, central objectives, and nexuses identified varied greatly
across country groupings; however, SDG 17 (partnerships for the goals) and
climate change ranked as highly important across many country groupings.
Temperature rise was strongly linked to urbanisation, air pollution, and slum
expansion (SDG 11), especially in country groupings likely to be worst
affected by climate breakdown, such as Africa. In several country groupings
composed of developing nations, we observed a consistent nexus of strongly
interconnected objectives formed by SDG 1 (poverty reduction), SDG 4
(education), and SDG 8 (economic growth), sometimes incorporating SDG 5
(gender equality), and SDG 16 (peace and justice). Interpretation: The
differences across groupings emphasise the need to define goals in accordance
with local circumstances and priorities. Our analysis highlights global
partnerships (SDG 17) as a pivot in global sustainability efforts, which have
been strongly linked to economic growth (SDG 8). However, if economic growth
and trade expansion were repositioned as a means instead of an end goal of
development, our analysis showed that education (SDG 4) and poverty reduction
(SDG 1) become more central, thus suggesting that these could be prioritised
in global partnerships. Urban livelihoods (SDG 11) were also flagged as
important to avoid replicating unsustainable patterns of the past.

O. Makansi, J. von Kügelgen, F. Locatello, P. Gehler, D. Janzing, T. Brox,
and B. Schölkopf.
**You mostly walk alone:
Analyzing feature attribution in trajectory prediction**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Predicting the future trajectory of a moving
agent can be easy when the past trajectory continues smoothly but is
challenging when complex interactions with other agents are involved. Recent
deep learning approaches for trajectory prediction show promising performance
and partially attribute this to successful reasoning about agent-agent
interactions. However, it remains unclear which features such black-box
models actually learn to use for making predictions. This paper proposes a
procedure that quantifies the contributions of different cues to model
performance based on a variant of Shapley values. Applying this procedure to
state-of-the-art trajectory prediction methods on standard benchmark datasets
shows that they are, in fact, unable to reason about interactions. Instead,
the past trajectory of the target is the only feature used for predicting its
future. For a task with richer social interaction patterns, on the other
hand, the tested models do pick up such interactions to a certain extent, as
quantified by our feature attribution method. We discuss the limits of the
proposed method and its links to causality.

R. Perry, J. von Kügelgen*, and B. Schölkopf*.
**Causal discovery in heterogeneous
environments under the sparse mechanism shift hypothesis**.
In *Advances in Neural Information Processing Systems 35*. Curran
Associates, Inc., 2022.
*shared last author.

** Abstract:** Machine learning approaches
commonly rely on the assumption of independent and identically distributed
(i.i.d.) data. In reality, however, this assumption is almost always violated
due to distribution shifts between environments. Although valuable learning
signals can be provided by heterogeneous data from changing distributions, it
is also known that learning under arbitrary (adversarial) changes is
impossible. Causality provides a useful framework for modeling distribution
shifts, since causal models encode both observational and interventional
distributions. In this work, we explore the sparse mechanism shift
hypothesis, which posits that distribution shifts occur due to a small number
of changing causal conditionals. Motivated by this idea, we apply it to
learning causal structure from heterogeneous environments, where i.i.d. data
only allows for learning an equivalence class of graphs without restrictive
assumptions. We propose the Mechanism Shift Score (MSS), a score-based
approach amenable to various empirical estimators, which provably identifies
the entire causal structure with high probability if the sparse mechanism
shift hypothesis holds. Empirically, we verify behavior predicted by the
theory and compare multiple estimators and score functions to identify the
best approaches in practice. Compared to other methods, we show how MSS
bridges a gap by both being nonparametric as well as explicitly leveraging
sparse changes.

P. Reizinger*, L. Gresele*, J. Brady*, J. von Kügelgen, D. Zietlow,
B. Schölkopf, G. Martius, W. Brendel, and M. Besserve.
**Embrace the gap: Vaes perform
independent mechanism analysis**.
In *Advances in Neural Information Processing Systems 35*. Curran
Associates, Inc., 2022.
*equal first authorship.

** Abstract:** Variational autoencoders
(VAEs) are a popular framework for modeling complex data distributions; they
can be efficiently trained via variational inference by maximizing the
evidence lower bound (ELBO), at the expense of a gap to the exact
(log-)marginal likelihood. While VAEs are commonly used for representation
learning, it is unclear why ELBO maximization would yield useful
representations, since unregularized maximum likelihood estimation cannot
invert the data-generating process. Yet, VAEs often succeed at this task. We
seek to elucidate this apparent paradox by studying nonlinear VAEs in the
limit of near-deterministic decoders. We first prove that, in this regime,
the optimal encoder approximately inverts the decoder - a commonly used but
unproven conjecture - which we refer to as *self-consistency*.
Leveraging self-consistency, we show that the ELBO converges to a regularized
log-likelihood. This allows VAEs to perform what has recently been termed
independent mechanism analysis (IMA): it adds an inductive bias towards
decoders with column-orthogonal Jacobians, which helps recovering the true
latent factors. The gap between ELBO and log-likelihood is therefore welcome,
since it bears unanticipated benefits for nonlinear representation learning.
In experiments on synthetic and image data, we show that VAEs uncover the
true latent factors when the data generating process satisfies the IMA
assumption.

B. Schölkopf* and J. von Kügelgen*.
**From statistical to causal
learning**.
In *Proceedings of the International Congress of Mathematicians (ICM)*.
EMS Press, 2022.
*equal contribution.

** Abstract:** We describe basic ideas
underlying research to build and understand artificially intelligent systems:
from symbolic approaches via statistical learning to interventional models
relying on concepts of causality. Some of the hard open problems of machine
learning and AI are intrinsically related to causality, and progress may
require advances in our understanding of how to model and infer causality
from data.

L. Schott, J. von Kügelgen, F. Träuble, P. Gehler, C. Russell,
M. Bethge, B. Schölkopf, F. Locatello, and W. Brendel.
**Visual representation
learning does not generalize strongly within the same domain**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** An important component for generalization in
machine learning is to uncover underlying latent factors of variation as well
as the mechanism through which each factor acts in the world. In this paper,
we test whether 17 unsupervised, weakly supervised, and fully supervised
representation learning approaches correctly infer the generative factors of
variation in simple datasets (dSprites, Shapes3D, MPI3D) from controlled
environments, and on our contributed CelebGlow dataset. In contrast to prior
robustness work that introduces novel factors of variation during test time,
such as blur or other (un)structured noise, we here recompose, interpolate,
or extrapolate only existing factors of variation from the training data set
(e.g., small and medium-sized objects during training and large objects
during testing). Models that learn the correct mechanism should be able to
generalize to this benchmark. In total, we train and test 2000+ models and
observe that all of them struggle to learn the underlying mechanism
regardless of supervision signal and architectural bias. Moreover, the
generalization capabilities of all tested models drop significantly as we
move from artificial datasets towards more realistic real-world datasets.
Despite their inability to identify the correct mechanism, the models are
quite modular as their ability to infer other in-distribution factors remains
fairly stable, providing only a single factor is out-of-distribution. These
results point to an important yet understudied problem of learning
mechanistic models of observations that can facilitate generalization.

C. Toth, L. Lorch, C. Knoll, A. Krause, F. Pernkopf, R. Peharz*, and J. von
Kügelgen*.
**Active Bayesian causal
inference**.
In *Advances in Neural Information Processing Systems 35*. Curran
Associates, Inc., 2022.
*shared last author.

** Abstract:** Causal discovery and causal
reasoning are classically treated as separate and consecutive tasks: one
first infers the causal graph, and then uses it to estimate causal effects of
interventions. However, such a two-stage approach is uneconomical, especially
in terms of actively collected interventional data, since the causal query of
interest may not require a fully-specified causal model. From a Bayesian
perspective, it is also unnatural, since a causal query (e.g., the causal
graph or some causal effect) can be viewed as a latent quantity subject to
posterior inference - other unobserved quantities that are not of direct
interest (e.g., the full causal model) ought to be marginalized out in this
process and contribute to our epistemic uncertainty. In this work, we propose
Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning
framework for integrated causal discovery and reasoning, which jointly infers
a posterior over causal models and queries of interest. In our approach to
ABCI, we focus on the class of causally-sufficient, nonlinear additive noise
models, which we model using Gaussian processes. We sequentially design
experiments that are maximally informative about our target causal query,
collect the corresponding interventional data, and update our beliefs to
choose the next experiment. Through simulations, we demonstrate that our
approach is more data-efficient than several baselines that only focus on
learning the full causal graph. This allows us to accurately learn downstream
causal queries from fewer samples while providing well-calibrated uncertainty
estimates for the quantities of interest.

J. von Kügelgen, A.-H. Karimi, U. Bhatt, I. Valera, A. Weller, and
B. Schölkopf.
**On the fairness of causal
algorithmic recourse**.
In *Proceedings of the 36th AAAI Conference on Artificial Intelligence
(AAAI)*, 2022.

** Abstract:** Algorithmic fairness is
typically studied from the perspective of predictions. Instead, here we
investigate fairness from the perspective of recourse actions suggested to
individuals to remedy an unfavourable classification. We propose two new
fairness criteria at the group and individual level, which - unlike prior
work on equalising the average group-wise distance from the decision boundary
- explicitly account for causal relationships between features, thereby
capturing downstream effects of recourse actions performed in the physical
world. We explore how our criteria relate to others, such as counterfactual
fairness, and show that fairness of recourse is complementary to fairness of
prediction. We study theoretically and empirically how to enforce fair causal
recourse by altering the classifier and perform a case study on the Adult
dataset. Finally, we discuss whether fairness violations in the data
generating process revealed by our criteria may be better addressed by
societal interventions as opposed to constraints on the classifier.

L. Gresele*, J. von Kügelgen*, V. Stimper, B. Schölkopf, and
M. Besserve.
**Independent
mechanisms analysis, a new concept?**.
In *Advances in Neural Information Processing Systems 34*, pages
28233-28248. Curran Associates, Inc., 2021.
*equal contribution.

** Abstract:** Independent component
analysis provides a principled framework for unsupervised representation
learning, with solid theory on the identifiability of the latent code that
generated the data, given only observations of mixtures thereof.
Unfortunately, when the mixing is nonlinear, the model is provably
nonidentifiable, since statistical independence alone does not sufficiently
constrain the problem. Identifiability can be recovered in settings where
additional, typically observed variables are included in the generative
process. We investigate an alternative path and consider instead including
assumptions reflecting the principle of independent causal mechanisms
exploited in the field of causality. Specifically, our approach is motivated
by thinking of each source as independently influencing the mixing process.
This gives rise to a framework which we term independent mechanism analysis.
We provide theoretical and empirical evidence that our approach circumvents a
number of nonidentifiability issues arising in nonlinear blind source
separation.

Z. Jin*, J. von Kügelgen*, J. Ni, T. Vaidhya, A. Kaushal, M. Sachan, and
B. Schölkopf.
**Causal direction of
data collection matters: Implications of causal and anticausal learning for
nlp**.
In *Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 9499-9513. Association for
Computational Linguistics, 2021, doi
10.18653/v1/2021.emnlp-main.748.
*equal contribution.

** Abstract:** The principle of independent
causal mechanisms (ICM) states that generative processes of real world data
consist of independent modules which do not influence or inform each other.
While this idea has led to fruitful developments in the field of causal
inference, it is not widely-known in the NLP community. In this work, we
argue that the causal direction of the data collection process bears
nontrivial implications that can explain a number of published NLP findings,
such as differences in semi-supervised learning (SSL) and domain adaptation
(DA) performance across different settings. We categorize common NLP tasks
according to their causal direction and empirically assay the validity of the
ICM principle for text data using minimum description length. We conduct an
extensive meta-analysis of over 100 published SSL and 30 DA studies, and find
that the results are consistent with our expectations based on causal
insights. This work presents the first attempt to analyze the ICM principle
in NLP, and provides constructive suggestions for future modeling
choices.

M. Tangemann, S. Schneider, J. von Kügelgen, F. Locatello, P. Gehler,
T. Brox, M. Kümmerer, M. Bethge, and B. Schölkopf.
**Unsupervised object learning via
common fate**, 2021.

** Abstract:** Learning generative
object models from unlabelled videos is a long standing problem and required
for causal scene modeling. We decompose this problem into three easier
subtasks, and provide candidate solutions for each of them. Inspired by the
Common Fate Principle of Gestalt Psychology, we first extract (noisy) masks
of moving objects via unsupervised motion segmentation. Second, generative
models are trained on the masks of the background and the moving objects,
respectively. Third, background and foreground models are combined in a
conditional "dead leaves" scene model to sample novel scene configurations
where occlusions and depth layering arise naturally. To evaluate the
individual stages, we introduce the Fishbowl dataset positioned between
complex real-world scenes and common object-centric benchmarks of simplistic
objects. We show that our approach allows learning generative models that
generalize beyond the occlusions present in the input videos, and represent
scenes in a modular fashion that allows sampling plausible scenes outside the
training distribution by permitting, for instance, object numbers or
densities not observed in the training set.

F. Träuble, J. von Kügelgen, M. Kleindessner, F. Locatello,
B. Schölkopf, and P. Gehler.
**Backward-compatible
prediction updates: A probabilistic approach**.
In *Advances in Neural Information Processing Systems 34*, pages
116-128. Curran Associates, Inc., 2021.

** Abstract:** When
machine learning systems meet real world applications, accuracy is only one
of several requirements. In this paper, we assay a complementary perspective
originating from the increasing availability of pre-trained and regularly
improving state-of-the-art models. While new improved models develop at a
fast pace, downstream tasks vary more slowly or stay constant. Assume that we
have a large unlabelled data set for which we want to maintain accurate
predictions. Whenever a new and presumably better ML models becomes
available, we encounter two problems: (i) given a limited budget, which data
points should be re-evaluated using the new model?; and (ii) if the new
predictions differ from the current ones, should we update? Problem (i) is
about compute cost, which matters for very large data sets and models.
Problem (ii) is about maintaining consistency of the predictions, which can
be highly relevant for downstream applications; our demand is to avoid
negative flips, i.e., changing correct to incorrect predictions. In this
paper, we formalize the Prediction Update Problem and present an efficient
probabilistic approach as answer to the above questions. In extensive
experiments on standard classification benchmark data sets, we show that our
method outperforms alternative strategies along key metrics for
backward-compatible prediction updates.

J. von Kügelgen*, L. Gresele*, and B. Schölkopf.
**Simpson's
paradox in covid-19 case fatality rates: a mediation analysis of age-related
causal effects**.
*IEEE Transactions on Artificial Intelligence*, 2(1):18-27, 2021, doi
10.1109/TAI.2021.3073088.
*equal contribution.

** Abstract:** We point out an
instantiation of Simpson's paradox in COVID-19 case fatality rates (cfrs):
comparing a large-scale study from China (February 17) with early reports
from Italy (March 9), we find that cfrs are lower in Italy for every age
group, but higher overall. This phenomenon is explained by a stark difference
in case demographic between the two countries. Using this as a motivating
example, we introduce basic concepts from mediation analysis and show how
these can be used to quantify different direct and indirect effects when
assuming a coarse-grained causal graph involving country, age, and case
fatality. We curate an age-stratified cfr dataset with >750 k cases and
conduct a case study, investigating total, direct, and indirect
(age-mediated) causal effects between different countries and at different
points in time. This allows us to separate age-related effects from others
unrelated to age and facilitates a more transparent comparison of cfrs across
countries at different stages of the COVID-19 pandemic. Using longitudinal
data from Italy, we discover a sign reversal of the direct causal effect in
mid-March, which temporally aligns with the reported collapse of the
healthcare system in parts of the country. Moreover, we find that direct and
indirect effects across 132 pairs of countries are only weakly correlated,
suggesting that a country's policy and case demographic may be largely
unrelated. We point out limitations and extensions for future work, and
finally, discuss the role of causal reasoning in the broader context of using
AI to combat the COVID-19 pandemic.

J. von Kügelgen*, Y. Sharma*, L. Gresele*, W. Brendel, B. Schölkopf,
M. Besserve, and F. Locatello.
**Self-supervised
learning with data augmentations provably isolates content from
style**.
In *Advances in Neural Information Processing Systems 34*, pages
16451-16467. Curran Associates, Inc., 2021.
*equal contribution.

** Abstract:** Self-supervised
representation learning has shown remarkable success in a number of domains.
A common practice is to perform data augmentation via hand-crafted
transformations intended to leave the semantics of the data invariant. We
seek to understand the empirical success of this approach from a theoretical
perspective. We formulate the augmentation process as a latent variable model
by postulating a partition of the latent representation into a content
component, which is assumed invariant to augmentation, and a style component,
which is allowed to change. Unlike prior work on disentanglement and
independent component analysis, we allow for both nontrivial statistical and
causal dependencies in the latent space. We study the identifiability of the
latent representation based on pairs of views of the observations and prove
sufficient conditions that allow us to identify the invariant content
partition up to an invertible mapping in both generative and discriminative
settings. We find numerical simulations with dependent latent variables are
consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of
high-dimensional, visually complex images with rich causal dependencies,
which we use to study the effect of data augmentations performed in
practice.

A.-H. Karimi*, J. von Kügelgen*, B. Schölkopf, and I. Valera.
**Algorithmic
recourse under imperfect causal knowledge: a probabilistic approach**.
In *Advances in Neural Information Processing Systems 33*, pages
265-277. Curran Associates, Inc., 2020.
*equal contribution.

** Abstract:** Recent work has discussed
the limitations of counterfactual explanations to recommend actions for
algorithmic recourse, and argued for the need of taking causal relationships
between features into consideration. Unfortunately, in practice, the true
underlying structural causal model is generally unknown. In this work, we
first show that it is impossible to guarantee recourse without access to the
true structural equations. To address this limitation, we propose two
probabilistic approaches to select optimal actions that achieve recourse with
high probability given limited causal knowledge (e.g., only the causal
graph). The first captures uncertainty over structural equations under
additive Gaussian noise, and uses Bayesian model averaging to estimate the
counterfactual distribution. The second removes any assumptions on the
structural equations by instead computing the average effect of recourse
actions on individuals similar to the person who seeks recourse, leading to a
novel subpopulation-based interventional notion of recourse. We then derive a
gradient-based procedure for selecting optimal recourse actions, and
empirically show that the proposed approaches lead to more reliable
recommendations under imperfect causal knowledge than non-probabilistic
baselines.

J. von Kügelgen, A. Mey, M. Loog, and B. Schölkopf.
**Semi-supervised
learning, causality, and the conditional cluster assumption**.
In *Proceedings of the 36th International Conference on Uncertainty in
Artificial Intelligence (UAI)*, volume 124 of *Proceedings of Machine
Learning Research*, pages 1-10. PMLR, 2020.
*also at NeurIPS 2019 Workshop Do the right thing: machine learning and causal
inference for improved decision making.

** Abstract:** While
the success of semi-supervised learning (SSL) is still not fully understood,
Schölkopf et al. (2012) have established a link to the principle of
independent causal mechanisms. They conclude that SSL should be impossible
when predicting a target variable from its causes, but possible when
predicting it from its effects. Since both these cases are restrictive, we
extend their work by considering classification using cause and effect
features at the same time, such as predicting a disease from both risk
factors and symptoms. While standard SSL exploits information contained in
the marginal distribution of all inputs (to improve the estimate of the
conditional distribution of the target given in-puts), we argue that in our
more general setting we should use information in the conditional
distribution of effect features given causal features. We explore how this
insight generalises the previous understanding, and how it relates to and can
be exploited algorithmically for SSL.

J. von Kügelgen*, I. Ustyuzhaninov*, P. Gehler, M. Bethge, and
B. Schölkopf.
**Towards causal generative scene
models via competition of experts**.
In *ICLR 2020 Workshop "Causal Learning for Decision Making"*, 2020.
*equal contribution.

** Abstract:** Learning how to model
complex scenes in a modular way with recombinable components is a
pre-requisite for higher-order reasoning and acting in the physical world.
However, current generative models lack the ability to capture the inherently
compositional and layered nature of visual scenes. While recent work has made
progress towards unsupervised learning of object-based scene representations,
most models still maintain a global representation space (i.e., objects are
not explicitly separated), and cannot generate scenes with novel object
arrangement and depth ordering. Here, we present an alternative approach
which uses an inductive bias encouraging modularity by training an ensemble
of generative models (experts). During training, experts compete for
explaining parts of a scene, and thus specialise on different object classes,
with objects being identified as parts that re-occur across multiple scenes.
Our model allows for controllable sampling of individual objects and
recombination of experts in physically plausible ways. In contrast to other
methods, depth layering and occlusion are handled correctly, moving this
approach closer to a causal generative scene model. Experiments on simple toy
data qualitatively demonstrate the conceptual advantages of the proposed
approach.

J. von Kügelgen, P. K. Rubenstein, B. Schölkopf, and A. Weller.
**Optimal experimental design via
bayesian optimization: active causal structure learning for gaussian process
networks**.
In *NeurIPS 2019 Workshop Do the right thing: machine learning and causal
inference for improved decision making*, December 2019.

**
Abstract:** We study the problem of causal discovery through targeted
interventions. Starting from few observational measurements, we follow a
Bayesian active learning approach to perform those experiments which, in
expectation with respect to the current model, are maximally informative
about the underlying causal structure. Unlike previous work, we consider the
setting of continuous random variables with non-linear functional
relationships, modelled with Gaussian process priors. To address the arising
problem of choosing from an uncountable set of possible interventions, we
propose to use Bayesian optimisation to efficiently maximise a Monte Carlo
estimate of the expected information gain.

J. von Kügelgen, A. Mey, and M. Loog.
**Semi-generative
modelling: Covariate-shift adaptation with cause and effect features**.
In *22nd International Conference on Artificial Intelligence and
Statistics*, volume 89, pages 1361-1369. PMLR, 2019.

** Abstract:** Current methods for covariate-shift adaptation
use unlabelled data to compute importance weights or domain-invariant
features, while the final model is trained on labelled data only. Here, we
consider a particular case of covariate shift which allows us also to learn
from unlabelled data, that is, combining adaptation with semi-supervised
learning. Using ideas from causality, we argue that this requires learning
with both causes, $X_C$, and effects, $X_E$, of a target variable, $Y$, and
show how this setting leads to what we call a semi-generative model,
$P(Y,X_E|X_C,θ)$. Our approach is robust to domain shifts in the
distribution of causal features and leverages unlabelled data by learning a
direct map from causes to effects. Experiments on synthetic data demonstrate
significant improvements in classification over purely-supervised and
importance-weighting baselines.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
Graepel, and Zoubin Ghahramani.
**Sigma: simple
greedy matching for aligning large knowledge bases**.
In *KDD*, pages 572-580. Association for Computing Machinery, 2013.

** Abstract:** The Internet has enabled the creation of a
growing number of large-scale knowledge bases in a variety of domains
containing complementary information. Tools for automatically aligning these
knowledge bases would make it possible to unify many sources of structured
knowledge and answer complex queries. However, the efficient alignment of
large-scale knowledge bases still poses a considerable challenge. Here, we
present Simple Greedy Matching (SiGMa), a simple algorithm for aligning
knowledge bases with millions of entities and facts. SiGMa is an iterative
propagation algorithm which leverages both the structural information from
the relationship graph as well as flexible similarity measures between entity
properties in a greedy local search, thus making it scalable. Despite its
greedy nature, our experiments indicate that SiGMa can efficiently match some
of the world's largest knowledge bases with high precision. We provide
additional experiments on benchmark datasets which demonstrate that SiGMa can
outperform state-of-the-art approaches both in accuracy and efficiency.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

Ferenc Huszár and Simon Lacoste-Julien.
**A kernel approach to
tractable Bayesian nonparametrics**.
Technical report, University of Cambridge, 2011.

** Abstract:**
Inference in popular nonparametric Bayesian models typically relies on
sampling or other approximations. This paper presents a general methodology
for constructing novel tractable nonparametric Bayesian methods by applying
the kernel trick to inference in a parametric Bayesian model. For example,
Gaussian process regression can be derived this way from Bayesian linear
regression. Despite the success of the Gaussian process framework, the kernel
trick is rarely explicitly considered in the Bayesian literature. In this
paper, we aim to fill this gap and demonstrate the potential of applying the
kernel trick to tractable Bayesian parametric models in a wider context than
just regression. As an example, we present an intuitive Bayesian kernel
machine for density estimation that is obtained by applying the kernel trick
to a Gaussian generative model in feature space.

** Comment:** arXiv:1103.1761

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, and Carl E. Rasmussen.
**Sparse Gaussian
process hyperparameters: Optimize or integrate?**.
In *nips36*, 2022.

** Abstract:** The kernel function and
its hyperparameters are the central model selection choice in a Gaussian
process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the
kernel are chosen by maximising the marginal likelihood, an approach known as
Type-II maximum likelihood (ML-II). However, ML-II does not account for
hyperparameter uncertainty, and it is well-known that this can lead to
severely biased estimates and an underestimation of predictive uncertainty.
While there are several works which employ a fully Bayesian characterisation
of GPs, relatively few propose such approaches for the sparse GPs paradigm.
In this work we propose an algorithm for sparse Gaussian process regression
which leverages MCMC to sample from the hyperparameter posterior within the
variational inducing point framework of [Titsias, 2009]. This work is closely
related to Hensman et al. [2015b], but side-steps the need to sample the
inducing points, thereby significantly improving sampling efficiency in the
Gaussian likelihood case. We compare this scheme against natural baselines in
literature along with stochastic variational GPs (SVGPs) along with an
extensive computational analysis.

Vidhi Lalchand, Aditya Ravuri, and Neil D. Lawrence.
**Generalised
GPLVM with Stochastic Variational Inference**.
In *25th International Conference on Artificial Intelligence and
Statistics*, volume 151 of *Proceedings of Machine Learning
Research*, pages 7841-7864. PMLR, 28-30 Mar 2022.

**
Abstract:** Gaussian process latent variable models (GPLVM) are a flexible
and non-linear approach to dimensionality reduction, extending classical
Gaussian processes to an unsupervised learning context. The Bayesian
incarnation of the GPLVM uses a variational framework, where the posterior
over latent variables is approximated by a well-behaved variational family, a
factorised Gaussian yielding a tractable lower bound. However, the
non-factorisability of the lower bound prevents truly scalable inference. In
this work, we study the doubly stochastic formulation of the Bayesian GPLVM
model amenable with minibatch training. We show how this framework is
compatible with different latent variable formulations and perform
experiments to compare a suite of models. Further, we demonstrate how we can
train in the presence of massively missing data and obtain high-fidelity
reconstructions. We demonstrate the model’s performance by benchmarking
against the canonical sparse GPLVM for high dimensional data examples.

Vidhi Lalchand, Kenza Tazi, Talay M Cheema, Richard E Turner, and Scott
Hosking.
**Kernel learning for explainable
climate science**.
In *16th Bayesian Modelling Applications Workshop at UAI, 2022*, 2022.

** Abstract:** The Upper Indus Basin, Himalayas provides water
for 270 million people and countless ecosystems. However, precipitation, a
key component to hydrological modelling, is poorly understood in this area. A
key challenge surrounding this uncertainty comes from the complex
spatial-temporal distribution of precipitation across the basin. In this work
we propose Gaussian processes with structured non-stationary kernels to model
precipitation patterns in the UIB. Previous attempts to quantify or model
precipitation in the Hindu Kush Karakoram Himalayan region have often been
qualitative or include crude assumptions and simplifications which cannot be
resolved at lower resolutions. This body of research also provides little to
no error propagation. We account for the spatial variation in precipitation
with a non-stationary Gibbs kernel parameterised with an input dependent
lengthscale. This allows the posterior function samples to adapt to the
varying precipitation patterns inherent in the distinct underlying topography
of the Indus region. The input dependent lengthscale is governed by a latent
Gaussian process with a stationary squared-exponential kernel to allow the
function level hyperparameters to vary smoothly. In ablation experiments we
motivate each component of the proposed kernel by demonstrating its ability
to model the spatial covariance, temporal structure and joint spatio-temporal
reconstruction. We benchmark our model with a stationary Gaussian process and
a Deep Gaussian processes.

Fergus Simpson, Ian Davies, Vidhi Lalchand, Alessandro Vullo, Nicolas Durrande,
and Carl Edward Rasmussen.
**Kernel
identification through transformers**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 10483-10495, 2021.

** Abstract:**
Kernel selection plays a central role in determining the performance of
Gaussian Process (GP) models, as the chosen kernel determines both the
inductive biases and prior support of functions under the GP prior. This work
addresses the challenge of constructing custom kernel functions for
high-dimensional GP regression models. Drawing inspiration from recent
progress in deep learning, we introduce a novel approach named KITT: Kernel
Identification Through Transformers. KITT exploits a transformer-based
architecture to generate kernel recommendations in under 0.1 seconds, which
is several orders of magnitude faster than conventional kernel search
algorithms. We train our model using synthetic data generated from priors
over a vocabulary of known kernels. By exploiting the nature of the
self-attention mechanism, KITT is able to process datasets with inputs of
arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong
performance over a diverse collection of regression benchmarks.

Fergus Simpson, Vidhi Lalchand, and Carl Edward Rasmussen.
**Marginalised
Gaussian Processes with Nested Sampling**.
In *Advances in Neural Information Processing Systems 34*,
volume 34, pages 13613-13625. Curran Associates, Inc., 2021.

** Abstract:** Gaussian Process models are a rich distribution
over functions with inductive biases controlled by a kernel function.
Learning occurs through optimisation of the kernel hyperparameters using the
marginal likelihood as the objective. This work proposes nested sampling as a
means of marginalising kernel hyperparameters, because it is a technique that
is well-suited to exploring complex, multi-modal distributions. We benchmark
against Hamiltonian Monte Carlo on time-series and two-dimensional regression
tasks, finding that a principled approach to quantifying hyperparameter
uncertainty substantially improves the quality of prediction intervals.

Vidhi Lalchand and Carl Edward Rasmussen.
**Approximate
inference for fully Bayesian Gaussian process regression**.
In *2nd Symposium on Advances in Approximate Bayesian Inference*, pages
1-12. PMLR, 2020.

** Abstract:** Learning in Gaussian Process
models occurs through the adaptation of hyperparameters of the mean and the
covariance function. The classical approach entails maximizing the marginal
likelihood yielding fixed point estimates (an approach called Type II maximum
likelihood or ML-II). An alternative learning procedure is to infer the
posterior over hyper-parameters in a hierarchical specication of GPs we call
Fully Bayesian Gaussian Process Regression (GPR). This work considers two
approximation schemes for the intractable hyperparameter posterior: 1)
Hamiltonian Monte Carlo (HMC) yielding a sampling based approximation and 2)
Variational Inference (VI) where the posterior over hyperparameters is
approximated by a factorized Gaussian (mean-field) or a full-rank Gaussian
accounting for correlations between hyperparameters. We analyse the
predictive performance for fully Bayesian GPR on a range of benchmark data
sets.

Lauro Langosco di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, and David
Krueger.
**Goal misgeneralization in deep
reinforcement learning**.
In *icml2022*, 2022.

Yanzhi Chen, Weihao Sun, Yingzhen Li, and Adrian Weller.
**Scalable infomin
learning**.
In *Advances in Neural Information Processing Systems*, 2022.

** Abstract:** The task of infomin learning aims to learn a
representation with high utility while being uninformative about a specified
target, with the latter achieved by minimising the mutual information between
the representation and the target. It has broad applications, ranging from
training fair prediction models against protected attributes, to unsupervised
learning with disentangled representations. Recent works on infomin learning
mainly use adversarial training, which involves training a neural network to
estimate mutual information or its proxy and thus is slow and difficult to
optimise. Drawing on recent advances in slicing techniques, we propose a new
infomin learning approach, which uses a novel proxy metric to mutual
information. We further derive an accurate and analytically computable
approximation to this proxy metric, thereby removing the need of constructing
neural network-based mutual information estimators. Compared to baselines,
experiments on algorithmic fairness, disentangled representation learning and
domain adaptation verify that our method can more effectively remove unwanted
information with limited time budget.

Andrew Foong, David Burt, Yingzhen Li, and Richard Turner.
**On the expressiveness of approximate inference in bayesian neural
networks**.
In *Advances in Neural Information Processing Systems 34*, 2020.

Yingzhen Li and Stephan Mandt.
**Disentangled Sequential
Autoencoder**.
In *35th International Conference on Machine Learning*, Stockholm
Sweden, July 2018.

** Abstract:** We present a VAE
architecture for encoding and generating high dimensional sequential data,
such as video or audio. Our deep generative model learns a latent
representation of the data which is split into a static and dynamic part,
allowing us to approximately disentangle latent time-dependent features
(dynamics) from features which are preserved over time (content). This
architecture gives us partial control over generating content and dynamics by
conditioning on either one of these sets of features. In our experiments on
artificially generated cartoon video clips and voice recordings, we show that
we can convert the content of a given sequence into another one by such
content swapping. For audio, this allows us to convert a male speaker into a
female speaker and vice versa, while for video we can separately manipulate
shapes and dynamics. Furthermore, we give empirical evidence for the
hypothesis that stochastic RNNs as latent state models are more efficient at
compressing and generating long sequences than deterministic ones, which may
be relevant for applications in video compression.

Yingzhen Li and Richard E. Turner.
**Gradient Estimators
for Implicit Models**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** Implicit models,
which allow for the generation of samples but not for point-wise evaluation
of probabilities, are omnipresent in real-world problems tackled by machine
learning and a hot topic of current research. Some examples include data
simulators that are widely used in engineering and scientific research,
generative adversarial networks (GANs) for image synthesis, and
hot-off-the-press approximate inference techniques relying on implicit
distributions. The majority of existing approaches to learning implicit
models rely on approximating the intractable distribution or optimisation
objective for gradient- based optimisation, which is liable to produce
inaccurate updates and thus poor models. This paper alleviates the need for
such approximations by proposing the Stein gradient estimator, which directly
estimates the score function of the implicitly defined distribution. The
efficacy of the proposed estimator is empirically demonstrated by examples
that include meta-learning for approximate inference and entropy regularised
GANs that provide improved sample diversities.

Cuong V. Nguyen, Yingzhen Li, and Thang D. Bui Richard E. Turner.
**Variational
Continual Learning**.
In *Sixth International Conference on Learning Representations*,
Vancouver CANADA, May 2018.

** Abstract:** This paper develops
variational continual learning (VCL), a simple but general framework for
continual learning that fuses online variational inference (VI) and recent
advances in Monte Carlo VI for neural networks. The framework can
successfully train both deep discriminative models and deep generative models
in complex continual learning settings where existing tasks evolve over time
and entirely new tasks emerge. Experimental results show that variational
continual learning outperforms state-of-the-art continual learning methods on
a variety of tasks, avoiding catastrophic forgetting in a fully automatic
way.

Yingzhen Li and Yarin Gal.
**Dropout Inference
in Bayesian Neural Networks with Alpha-divergences**.
In *34th International Conference on Machine Learning*, Sydney
AUSTRALIA, Aug 2017.

** Abstract:** To obtain uncertainty
estimates with real-world Bayesian deep learning models, practical inference
approximations are needed. Dropout variational inference (VI) for example has
been used for machine vision and medical applications, but VI can severely
underestimates model uncertainty. Alpha-divergences are alternative
divergences to VI’s KL objective, which are able to avoid VI’s
uncertainty underestimation. But these are hard to use in practice: existing
techniques can only use Gaussian approximating distributions, and require
existing models to be changed radically, thus are of limited use for
practitioners. We propose a re-parametrisation of the alpha-divergence
objectives, deriving a simple inference technique which, together with
dropout, can be easily implemented with existing models by simply changing
the loss of the model. We demonstrate improved uncertainty estimates and
accuracy compared to VI in dropout networks. We study our model’s epistemic
uncertainty far away from the data using adversarial images, showing that
these can be distinguished from non-adversarial images by examining our
model’s uncertainty.

Yingzhen Li and Richard E. Turner.
**Rényi
Divergence Variational Inference**.
In *Advances in Neural Information Processing Systems 29*, Barcelona
SPAIN, Dec 2016.

** Abstract:** This paper introduces the
variational Rényi bound (VR) that extends traditional variational
inference to Rényi's alpha-divergences. This new family of variational
methods unifies a number of existing approaches, and enables a smooth
interpolation from the evidence lower-bound to the log (marginal) likelihood
that is controlled by the value of alpha that parametrises the divergence.
The reparameterization trick, Monte Carlo approximation and stochastic
optimisation methods are deployed to obtain a tractable and unified framework
for optimisation. We further consider negative alpha values and propose a
novel variational inference method as a new special case in the proposed
framework. Experiments on Bayesian neural networks and variational
auto-encoders demonstrate the wide applicability of the VR bound.

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato,
Yingzhen Li, and Richard E. Turner.
**Deep Gaussian
processes for regression using approximate expectation propagation**.
In *33rd International Conference on Machine Learning*, New York, USA,
June 2016.

** Abstract:** Deep Gaussian processes (DGPs) are
multi-layer hierarchical generalisations of Gaussian processes (GPs) and are
formally equivalent to neural networks with multiple, infinitely wide hidden
layers. DGPs are nonparametric probabilistic models and as such are arguably
more flexible, have a greater capacity to generalise, and provide better
calibrated uncertainty estimates than alternative deep models. This paper
develops a new approximate Bayesian learning scheme that enables DGPs to be
applied to a range of medium to large scale regression problems for the first
time. The new method uses an approximate Expectation Propagation procedure
and a novel and efficient extension of the probabilistic backpropagation
algorithm for learning. We evaluate the new method for non-linear regression
on eleven real-world datasets, showing that it always outperforms GP
regression and is almost always better than state-of-the-art deterministic
and sampling-based approximate inference methods for Bayesian neural
networks. As a by-product, this work provides a comprehensive analysis of six
approximate Bayesian methods for training neural networks.

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui,
Daniel Hernández-Lobato, and Richard E. Turner.
**Black-box
Alpha Divergence Minimization**.
In *33rd International Conference on Machine Learning*, New York USA,
June 2016.

** Abstract:** Black-box alpha (BB-$α$) is a
new approximate inference method based on the minimization of
$α$-divergences. BB-$α$ scales to large datasets because it can be
implemented using stochastic gradient descent. BB-$α$ can be applied to
complex probabilistic models with little effort since it only requires as
input the likelihood function and its gradients. These gradients can be
easily obtained using automatic differentiation. By changing the divergence
parameter α, the method is able to interpolate between variational Bayes
(VB) ($α \rightarrow 0$) and an algorithm similar to expectation
propagation (EP) ($α = 1$). Experiments on probit regression and neural
network regression and classification problems show that BB-αwith
non-standard settings of $α$, such as $α = 0.5$, usually produces
better predictions than with $α \rightarrow 0$ (VB) or $α = 1$
(EP).

Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner.
**Stochastic Expectation
Propagation**.
In *Advances in Neural Information Processing Systems 28*, Montréal
CANADA, Dec 2015.

** Abstract:** Expectation propagation (EP)
is a deterministic approximation algorithm that is often used to perform
approximate Bayesian parameter learning. EP approximates the full intractable
posterior distribution through a set of local-approximations that are
iteratively refined for each datapoint. EP can offer analytic and
computational advantages over other approximations, such as Variational
Inference (VI), and is the method of choice for a number of models. The local
nature of EP appears to make it an ideal candidate for performing Bayesian
learning on large models in large-scale datasets settings. However, EP has a
crucial limitation in this context: the number approximating factors needs to
increase with the number of data-points, N, which often entails a
prohibitively large memory overhead. This paper presents an extension to EP,
called stochastic expectation propagation (SEP), that maintains a global
posterior approximation (like VI) but updates it in a local way (like EP ).
Experiments on a number of canonical learning problems using synthetic and
real-world datasets indicate that SEP performs almost as well as full EP, but
reduces the memory consumption by a factor of N. SEP is therefore ideally
suited to performing approximate Bayesian learning in the large model, large
dataset setting.

Valerii Likhosherstov, Krzysztof Choromanski, Avinava Dubey, Frederick Liu,
Tamas Sarlos, and Adrian Weller.
**Chefs' random tables:
Non-trigonometric random features**, 2022.

**
Abstract:** We introduce chefs' random tables (CRTs), a new class of
non-trigonometric random features (RFs) to approximate Gaussian and softmax
kernels. CRTs are an alternative to standard random kitchen sink (RKS)
methods, which inherently rely on the trigonometric maps. We present variants
of CRTs where RFs are positive, a key requirement for applications in recent
low-rank Transformers. Further variance reduction is possible by leveraging
statistics which are simple to compute. One instantiation of CRTs, the
optimal positive random features (OPRFs), is to our knowledge the first RF
method for unbiased softmax kernel estimation with positive and bounded RFs,
resulting in exponentially small tails and much lower variance than its
counterparts. As we show, orthogonal random features applied in OPRFs provide
additional variance reduction for any dimensionality d (not only
asymptotically for sufficiently large d, as for RKS). We test CRTs on many
tasks ranging from non-parametric classification to training Transformers for
text, speech and image data, obtaining new state-of-the-art results for
low-rank text Transformers, while providing linear space and time
complexity.

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song,
Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz
Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian
Weller.
**Rethinking attention
with performers**.
In *International Conference on Learning Representations*, 2021.

** Abstract:** We introduce Performers, Transformer
architectures which can estimate regular (softmax) full-rank-attention
Transformers with provable accuracy, but using only linear (as opposed to
quadratic) space and time complexity, without relying on any priors such as
sparsity or low-rankness. To approximate softmax attention-kernels,
Performers use a novel Fast Attention Via positive Orthogonal Random features
approach (FAVOR+), which may be of independent interest for scalable kernel
methods. FAVOR+ can also be used to efficiently model kernelizable attention
mechanisms beyond softmax. This representational power is crucial to
accurately compare softmax with other kernels for the first time on
large-scale tasks, beyond the reach of regular Transformers, and investigate
optimal attention-kernels. Performers are linear architectures fully
compatible with regular Transformers and with strong theoretical guarantees:
unbiased or nearly-unbiased estimation of the attention matrix, uniform
convergence and low estimation variance. We tested Performers on a rich set
of tasks stretching from pixel-prediction through text models to protein
sequence modeling. We demonstrate competitive results with other examined
efficient sparse and dense attention methods, showcasing effectiveness of the
novel attention-learning paradigm leveraged by Performers.

Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic,
Yi Tay, Adrian Weller, and Mostafa Dehghani.
**Polyvit: Co-training vision
transformers on images, videos and audio**.
*CoRR*, abs/2111.12993, 2021.

** Abstract:** Can we train
a single transformer model capable of processing multiple modalities and
datasets, whilst sharing almost all of its learnable parameters? We present
PolyViT, a model trained on image, audio and video which answers this
question. By co-training different tasks on a single modality, we are able to
improve the accuracy of each individual task and achieve state-of-the-art
results on 5 standard video- and audio-classification datasets. Co-training
PolyViT on multiple modalities and tasks leads to a model that is even more
parameter-efficient, and learns representations that generalize across
multiple domains. Moreover, we show that co-training is simple and practical
to implement, as we do not need to tune hyperparameters for each combination
of datasets, but can simply adapt those from standard, single-task
training.

Valerii Likhosherstov, Krzysztof M Choromanski, Jared Quincy Davis, Xingyou
Song, and Adrian Weller.
**Sub-linear
memory: How to make performers slim**.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan,
editors, *Advances in Neural Information Processing Systems*,
volume 34, pages 6707-6719. Curran Associates, Inc., 2021.

** Abstract:** The Transformer architecture has revolutionized
deep learning on sequential data, becoming ubiquitous in state-of-the-art
solutions for a wide variety of applications. Yet vanilla Transformers are
notoriously resource-expensive, requiring O(L2) in serial time and memory as
functions of input length L. Recent works proposed various linear
self-attention mechanisms, scaling only as O(L) for serial computation. We
perform a thorough analysis of recent Transformer mechanisms with linear
self-attention, Performers, in terms of overall computational complexity. We
observe a remarkable computational flexibility: forward and backward
propagation can be performed with no approximations using sublinear memory as
a function of L (in addition to negligible storage for the input sequence),
at a cost of greater time complexity in the parallel setting. In the extreme
case, a Performer consumes only O(1) memory during training, and still
requires O(L) time. This discovered time-memory tradeoff can be used for
training or, due to complete backward-compatibility, for fine-tuning on a
low-memory device, e.g. a smartphone or an earlier-generation GPU, thus
contributing towards decentralized and democratized deep learning.

Valerii Likhosherstov, Jared Davis, Krzysztof Choromanski, and Adrian Weller.
**Cwy
parametrization: a solution for parallelized optimization of orthogonal and
stiefel matrices**.
In Arindam Banerjee and Kenji Fukumizu, editors, *Proceedings of The 24th
International Conference on Artificial Intelligence and Statistics*,
volume 130 of *Proceedings of Machine Learning Research*, pages
55-63. PMLR, 13-15 Apr 2021.

** Abstract:** We introduce an
efficient approach for optimization over orthogonal groups on highly parallel
computation units such as GPUs or TPUs. As in earlier work, we parametrize an
orthogonal matrix as a product of Householder reflections. However, to
overcome low parallelization capabilities of computing Householder
reflections sequentially, we propose employing an accumulation scheme called
the compact WY (or CWY) transform – a compact parallelization-friendly
matrix representation for the series of Householder reflections. We further
develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold
parametrization which has a competitive complexity and, again, yields
benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY
methods lead to convergence to a stationary point of the training objective
when coupled with stochastic gradient descent. We apply our methods to train
recurrent neural network architectures in the tasks of neural machine
translation and video prediction.

Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Q Davis, and
Adrian Weller.
**Debiasing
a first-order heuristic for approximate bi-level optimization**.
In Marina Meila and Tong Zhang, editors, *Proceedings of the 38th
International Conference on Machine Learning*, volume 139 of
*Proceedings of Machine Learning Research*, pages 6621-6630. PMLR,
18-24 Jul 2021.

** Abstract:** Approximate bi-level
optimization (ABLO) consists of (outer-level) optimization problems,
involving numerical (inner-level) optimization loops. While ABLO has many
applications across deep learning, it suffers from time and memory complexity
proportional to the length $r$ of its inner optimization loop. To address
this complexity, an earlier first-order method (FOM) was proposed as a
heuristic which omits second derivative terms, yielding significant speed
gains and requiring only constant memory. Despite FOM’s popularity, there
is a lack of theoretical understanding of its convergence properties. We
contribute by theoretically characterizing FOM’s gradient bias under mild
assumptions. We further demonstrate a rich family of examples where FOM-based
SGD does not converge to a stationary point of the ABLO objective. We address
this concern by proposing an unbiased FOM (UFOM) enjoying constant memory
complexity as a function of $r$. We characterize the introduced time-variance
tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a
given ABLO problem. Finally, we propose an efficient adaptive UFOM
scheme.

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov,
Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack
Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos,
Adrian Weller, and Vikas Sindhwani.
**Stochastic flows and geometric
optimization on the orthogonal group**.
In *37th International Conference on Machine Learning*, 2020.

** Abstract:** We present a new class of stochastic,
geometrically-driven optimization algorithms on the orthogonal group O(d) and
naturally reductive homogeneous manifolds obtained from the action of the
rotation group SO(d). We theoretically and experimentally demonstrate that
our methods can be applied in various fields of machine learning including
deep, convolutional and recurrent neural networks, reinforcement learning,
normalizing flows and metric learning. We show an intriguing connection
between efficient stochastic optimization on the orthogonal group and graph
theory (e.g. matching problem, partition functions over graphs,
graph-coloring). We leverage the theory of Lie groups and provide theoretical
results for the designed class of algorithms. We demonstrate broad
applicability of our methods by showing strong performance on the seemingly
unrelated tasks of learning world models to obtain stable policies for the
most difficult Humanoid agent from OpenAI Gym and improving convolutional
neural networks.

Xinyuan Cao, Weiyang Liu, and Santosh Vempala.
**Provable lifelong learning of representations**.
In *International Conference on Artificial Intelligence and Statistics*,
pages 6334-6356. PMLR, 2022.

Weiyang Liu, Zhen Liu, Liam Paull, Adrian Weller, and Bernhard Schölkopf.
**Structural causal 3d reconstruction**.
In *European Conference on Computer Vision*, pages 140-159. Springer,
2022.

Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian
Tang.
**Pre-training molecular graph representation with 3d geometry**.
In *International Conference on Learning Representations*, 2022.

Weiyang Liu, Yandong Wen, Bhiksha Raj, Rita Singh, and Adrian Weller.
**Sphereface revived: Unifying hyperspherical face recognition**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2022.

Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, and Rita Singh.
**Sphereface2: Binary classification is all you need for deep face
recognition**.
In *International Conference on Learning Representations*, 2022.

Hanlin Zhang, Yi-Fan Zhang, Weiyang Liu, Adrian Weller, Bernhard Schölkopf,
and Eric P Xing.
**Towards principled disentanglement for domain generalization**.
In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition*, pages 8024-8034, 2022.

Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Liam Paull, Li Xiong,
Le Song, and Adrian Weller.
**Orthogonal over-parameterized training**.
In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition*, pages 7251-7260, 2021.

Weiyang Liu, Rongmei Lin, Zhen Liu, Li Xiong, Bernhard Schölkopf, and
Adrian Weller.
**Learning with hyperspherical uniformity**.
In *International Conference On Artificial Intelligence and Statistics*,
pages 1180-1188. PMLR, 2021.

Weiyang Liu, Zhen Liu, Hanchen Wang, Liam Paull, Bernhard Schölkopf, and
Adrian Weller.
**Iterative teaching by label synthesis**.
*Advances in Neural Information Processing Systems*, 34:21681-21695,
2021.

Yandong Wen, Weiyang Liu, Bhiksha Raj, and Rita Singh.
**Self-supervised 3d face reconstruction via conditional estimation**.
In *Proceedings of the IEEE/CVF International Conference on Computer
Vision*, pages 13289-13298, 2021.

Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, and
Anshumali Shrivastava.
**Locality sensitive teaching**.
*Advances in Neural Information Processing Systems*, 34:18049-18062,
2021.

James Robert Lloyd and Zoubin Ghahramani.
**Statistical
model criticism using kernel two sample tests**.
In *Advances in Neural Information Processing Systems 29*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** We propose an
exploratory approach to statistical model criticism using maximum mean
discrepancy (MMD) two sample tests. Typical approaches to model criticism
require a practitioner to select a statistic by which to measure
discrepancies between data and a statistical model. MMD two sample tests are
instead constructed as an analytic maximisation over a large space of
possible statistics and therefore automatically select the statistic which
most shows any discrepancy. We demonstrate on synthetic data that the
selected statistic, called the witness function, can be used to identify
where a statistical model most misrepresents the data it was trained on. We
then apply the procedure to real data where the models being assessed are
restricted Boltzmann machines, deep belief networks and Gaussian process
regression and demonstrate the ways in which these models fail to capture the
properties of the data they are trained on.

Tomoharu Iwata, James Robert Lloyd, and Zoubin Ghahramani.
**Unsupervised
many-to-many object matching for relational data**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
2015.

** Abstract:** We propose a method for unsupervised
many-to-many object matching from multiple networks, which is the task of
finding correspondences between groups of nodes in different networks. For
example, the proposed method can discover shared word groups from
multi-lingual document-word networks without cross-language alignment
information. We assume that multiple networks share groups, and each group
has its own interaction pattern with other groups. Using infinite relational
models with this assumption, objects in different networks are clustered into
common groups depending on their interaction patterns, discovering a
matching. The effectiveness of the proposed method is experimentally
demonstrated by using synthetic and real relational data sets, which include
applications to cross-domain recommendation without shared user/item
identifiers and multi-lingual word clustering.

James Rovert Lloyd.
**Representation,
learning, description and criticism of probabilistic models with applications
to networks, functions and relational data**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2015.

** Abstract:** This thesis makes contributions to a
variety of aspects of probabilistic inference. When performing probabilistic
inference, one must first represent one’s beliefs with a probability
distribution. Specifying the details of a probability distribution can be a
difficult task in many situations, but when expressing beliefs about complex
data structures it may not even be apparent what form such a distribution
should take. This thesis starts by demonstrating how representation theorems
due to Aldous, Hoover and Kallenberg can be used to specify appropriate
models for data in the form of networks. These theorems are then extended in
order to reveal appropriate probability distributions for arbitrary
relational data or databases. A simpler data structure to specify probability
distributions for is that of functions; many probability distributions for
functions have been used for centuries. We demonstrate that many of these
distributions can be expressed in a common language of Gaussian process
kernels constructed from a few base elements and operators. The structure of
this language allows for the effective automatic construction of
probabilistic models for functions. Furthermore, the formal mathematical
language of kernels can be mapped neatly onto natural language allowing for
automatic descriptions of the automatically constructed models. By further
automating the construction of statistical models, the need to be able to
effectively check or criticise these models becomes greater. This thesis
demonstrates how kernel two sample tests can be used to demonstrate where a
probabilistic model most disagrees with data allowing for targeted
improvements to the model. In proposing a new method of model criticism this
thesis also briefly discusses the philosophy of model criticism within the
context of probabilistic inference.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
In *Association for the Advancement of Artificial Intelligence (AAAI)*,
July 2014.

** Abstract:** This paper presents the beginnings
of an automatic statistician, focusing on regression problems. Our system
explores an open-ended space of statistical models to discover a good
explanation of a data set, and then produces a detailed report with figures
and natural-language text. Our approach treats unknown regression functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes can model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models
this allows us to automatically describe functions in simple terms. Second,
the use of flexible nonparametric models and a rich language for composing
them in an open-ended manner also results in state-of-the-art extrapolation
performance evaluated over 13 real time series data sets from various
domains.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

James Robert Lloyd.
**Gefcom2012
hierarchical load forecasting: Gradient boosting machines and gaussian
processes**.
*International Journal of Forecasting*, 2013.

**
Abstract:** This report discusses methods for forecasting hourly loads of a
US utility as part of the load forecasting track of the Global Energy
Forecasting Competition 2012 hosted on Kaggle. The methods described
(gradient boosting machines and Gaussian processes) are generic machine
learning / regression algorithms and few domain specific adjustments were
made. Despite this, the algorithms were able to produce highly competitive
predictions and hopefully they can inspire more reﬁned techniques to
compete with state-of-the-art load forecasting methodologies.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani.
**Antithetic and Monte Carlo
kernel estimators for partial rankings**.
*arXiv preprint arXiv:1807.00400*, 2018.

** Abstract:**
In the modern age, rankings data is ubiquitous and it is useful for a variety
of applications such as recommender systems, multi-object tracking and
preference learning. However, most rankings data encountered in the real
world is incomplete, which prevents the direct application of existing
modelling tools for complete rankings. Our contribution is a novel way to
extend kernel methods for complete rankings to partial rankings, via
consistent Monte Carlo estimators for Gram matrices: matrices of kernel
values between pairs of observations. We also present a novel variance
reduction scheme based on an antithetic variate construction between
permutations to obtain an improved estimator for the Mallows kernel. The
corresponding antithetic kernel estimator has lower variance and we
demonstrate empirically that it has a better performance in a variety of
Machine Learning tasks. Both kernel estimators are based on extending kernel
mean embeddings to the embedding of a set of full rankings consistent with an
observed partial ranking. They form a computationally tractable alternative
to previous approaches for partial rankings data. An overview of the existing
kernels and metrics for permutations is also provided.

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria
Lomeli, and Zoubin Ghahramani.
**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
marginal sampler for sigma-Stable Poisson-Kingman mixture models**.
*Journal of Computational and Graphical Statistics*, 26:44-53, 2017.

** Abstract:** We investigate the class of sigma-stable
Poisson-Kingman random probability measures (RPMs) in the context of Bayesian
nonparametric mixture modeling. This is a large class of discrete RPMs, which
encompasses most of the popular discrete RPMs used in Bayesian
nonparametrics, such as the Dirichlet process, Pitman-Yor process, the
normalized inverse Gaussian process, and the normalized generalized Gamma
process. We show how certain sampling properties and marginal
characterizations of sigma-stable Poisson-Kingman RPMs can be usefully
exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for
performing posterior inference with a Bayesian nonparametric mixture model.
Specifically, we introduce a novel and efficient MCMC sampling scheme in an
augmented space that has a small number of auxiliary variables per iteration.
We apply our sampling scheme to a density estimation and clustering tasks
with unidimensional and multidimensional datasets, and compare it against
competing MCMC sampling schemes. Supplementary materials for this article are
available online.

Maria Lomeli.
**General Bayesian inference
schemes in infinite mixture models**.
PhD thesis, University College London,Gatsby Unit, London, UK, 2017.

** Abstract:** Bayesian statistical models allow us to formalise
our knowledge about the world and reason about our uncertainty, but there is
a need for better procedures to accurately encode its complexity. One way to
do so is through compositional models, which are formed by combining blocks
consisting of simpler models. One can increase the complexity of the
compositional model by either stacking more blocks or by using a
not-so-simple model as a building block. This thesis is an example of the
latter. One first aim is to expand the choice of Bayesian nonparametric (BNP)
blocks for constructing tractable compositional models. So far, most of the
models that have a Bayesian nonparametric component use a Dirichlet Process
or a Pitman-Yor process because of the availability of tractable and compact
representations. This thesis shows how to overcome certain intractabilities
in order to obtain analogous compact representations for the class of
Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes.
A major impediment to the widespread use of Bayesian nonparametric building
blocks is that inference is often costly, intractable or difficult to carry
out. This is an active research area since dealing with the model's infinite
dimensional component forbids the direct use of standard simulation-based
methods. The main contribution of this thesis is a variety of inference
schemes that tackle this problem: Markov chain Monte Carlo and Sequential
Monte Carlo methods, which are exact inference schemes since they target the
true posterior. The contributions of this thesis, in a larger context,
provide general purpose exact inference schemes in the flavour or
probabilistic programming: the user is able to choose from a variety of
models, focusing only on the modelling part. Indeed, if the wide enough class
of Poisson-Kingman priors is used as one of our blocks, this objective is
achieved.

Maria Lomeli, Stefano Favaro, and Yee Whye Teh.
**A
hybrid sampler for Poisson-Kingman mixture models**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** This paper
concerns the introduction of a new Markov Chain Monte Carlo scheme for
posterior sampling in Bayesian nonparametric mixture models with priors that
belong to the general Poisson-Kingman class. We present a novel and compact
way of representing the infinite dimensional component of the model such that
while explicitly representing this infinite component it has less memory and
storage requirements than previous MCMC schemes. We describe comparative
simulation results demonstrating the efficacy of the proposed MCMC algorithm
against existing marginal and conditional MCMC samplers.

Stefano Favaro, Maria Lomeli, and Yee Whye Teh.
**On a
class of sigma-Stable Poisson-Kingman models and an effective marginalised
sampler**.
*Statistics and Computing*, 25:67-78, 2015.

**
Abstract:** We investigate the use of a large class of discrete random
probability measures, which is referred to as the class Q, , in the context
of Bayesian nonparametric mixture modeling. The class Q encompasses both the
the two-parameter Poisson?Dirichlet process and the normalized generalized
Gamma process, thus allowing us to comparatively study the inferential
advantages of these two well-known nonparametric priors. Apart from ahighly
flexible parameterization, the distinguishing feature of the class Q is the
availability of a tractable posterior distribution. This feature, in turn,
leads to derive an efficient marginal MCMC algorithm for posterior sampling
within the framework of mixture models. We demonstrate the efficacy of our
modeling framework on both one-dimensional and multi-dimensional
datasets.

Stefano Favaro, Maria Lomeli, Bernardo Nipoti, and Yee Whye Teh.
**Stick-breaking
representations of sigma-Stable Poisson-Kingman models**.
*Electronic Journal of Statistics*, 8:1063-1085, 2014.

**
Abstract:** In this paper we investigate the stick-breaking representation
for the class of sigma-Stable Poisson-Kingman models, also known as
Gibbs-type random probability measures. This class includes as special cases
most of the discrete priors commonly used in Bayesian nonparametrics, such as
the two parameter Poisson-Dirichlet process and the normalized generalized
Gamma process. Under the assumption sigma=u/v, for any coprime integers
1 \lt =u < v such that u/v < 1/2, we show that a sigma-stable Poisson-Kingman model
admits an explicit stick-breaking representation in terms of random variables
which are obtained by suitably transforming Gamma random variables and
products of independent Beta and Gamma random variables.

Dino Sejdinovic, Heiko Strathmann, Maria Lomeli, Christophe Andrieu, and Arthur
Gretton.
**Kernel adaptive
Metropolis-Hastings**.
In *31st International Conference on Machine Learning*, pages 1-9,
Beijing, China, June 2012.

** Abstract:** A Kernel Adaptive
Metropolis-Hastings algo- rithm is introduced, for the purpose of sampling
from a target distribution with strongly nonlin- ear support. The algorithm
embeds the trajec- tory of the Markov chain into a reproducing ker- nel
Hilbert space (RKHS), such that the fea- ture space covariance of the samples
informs the choice of proposal. The procedure is com- putationally efficient
and straightforward to im- plement, since the RKHS moves can be inte- grated
out analytically: our proposal distribu- tion in the original space is a
normal distribution whose mean and covariance depend on where the current
sample lies in the support of the tar- get distribution, and adapts to its
local covari- ance structure. Furthermore, the procedure re- quires neither
gradients nor any other higher or- der information about the target, making
it par- ticularly attractive for contexts such as Pseudo- Marginal MCMC.
Kernel Adaptive Metropolis- Hastings outperforms competing fixed and adap-
tive samplers on multivariate, highly nonlinear target distributions, arising
in both real-world and synthetic examples.

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard
Schölkopf.
**Randomized
nonlinear component analysis**.
In *ICML*, volume 29 of *JMLR Proceedings*. JMLR.org,
2014.

** Abstract:** Classical techniques such as Principal
Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are
ubiquitous in statistics. However, these techniques only reveal linear
relationships in data. Although nonlinear variants of PCA and CCA have been
proposed, they are computationally prohibitive in the large scale. In a
separate strand of recent research, randomized methods have been proposed to
construct features that help reveal nonlinear patterns in data. For basic
tasks such as regression or classification, random features exhibit little or
no loss in performance, while achieving dramatic savings in computational
requirements. In this paper we leverage randomness to design scalable new
variants of nonlinear PCA and CCA; our ideas also extend to key multivariate
analysis tools such as spectral clustering or LDA. We demonstrate our
algorithms through experiments on real-world data, on which we compare
against the state-of-the-art. Code in R implementing our methods is provided
in the Appendix.

David Lopez-Paz, Philipp Hennig, and Bernhard Scholköpf.
**The randomized dependence
coefficient**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** We
introduce the Randomized Dependence Coefficient (RDC), a measure of
non-linear dependence between random variables of arbitrary dimension based
on the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. RDC
is defined in terms of correlation of random non-linear copula projections;
it is invariant with respect to marginal distribution transformations, has
low computational cost and is easy to implement: just five lines of R code,
included at the end of the paper.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf.
**Semi-supervised
domain adaptation with non-parametric copulas**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
new framework based on the theory of copulas is proposed to address
semisupervised domain adaptation problems. The presented method factorizes
any multivariate density into a product of marginal distributions and
bivariate copula functions. Therefore, changes in each of these factors can
be detected and corrected to adapt a density model accross different learning
domains. Importantly, we introduce a novel vine copula model, which allows
for this factorization in a non-parametric manner. Experimental results on
regression problems with real-world data illustrate the efficacy of the
proposed approach when compared to state-of-the-art techniques.

Matthew Ashman, Thang D. Bui, Cuong V. Nguyen, Efstratios Markou, Adrian
Weller, Siddharth Swaroop, and Richard E. Turner.
**Partitioned variational inferece:
A framework for probabilistic federated learning**.
2022.

** Abstract:** The proliferation of computing devices has
brought about an opportunity to deploy machine learning models on new problem
domains using previously inaccessible data. Traditional algorithms for
training such models often require data to be stored on a single machine with
compute performed by a single node, making them unsuitable for decentralised
training on multiple devices. This deficiency has motivated the development
of federated learning algorithms, which allow multiple data owners to train
collaboratively and use a shared model whilst keeping local data private.
However, many of these algorithms focus on obtaining point estimates of model
parameters, rather than probabilistic estimates capable of capturing model
uncertainty, which is essential in many applications. Variational inference
(VI) has become the method of choice for fitting many modern probabilistic
models. In this paper we introduce partitioned variational inference (PVI), a
general framework for performing VI in the federated setting. We develop new
supporting theory for PVI, demonstrating a number of properties that make it
an attractive choice for practitioners; use PVI to unify a wealth of
fragmented, yet related literature; and provide empirical results that
showcase the effectiveness of PVI in a variety of federated settings.

Gergely Flamich, Stratis Markou, and José Miguel Hernández-Lobato.
**Fast relative entropy coding with
A* coding**.
In *39th International Conference on Machine Learning*, 2022.

** Abstract:** Relative entropy coding (REC) algorithms encode a
sample from a target distribution $Q$ using a proposal distribution $P$, such
that the expected codelength is $\mathcalO(D_KL[Q||P])$. REC can be
seamlessly integrated with existing learned compression models since, unlike
entropy coding, it does not assume discrete $Q$ or $P$, and does not require
quantisation. However, general REC algorithms require an intractable
$Ømega(e^D_KL[Q||P])$ runtime. We introduce AS* and AD* coding, two REC
algorithms based on A* sampling. We prove that, for continuous distributions
over $\mathbbR$, if the density ratio is unimodal, AS* has
$\mathcalO(D_[Q||P]QP)$ expected runtime, where
$D_[Q||P]QP$ is the Rényi $\infty$-divergence. We provide
experimental evidence that AD* also has $\mathcalO(D_[Q||P]QP)$
expected runtime. We prove that AS* and AD* achieve an expected codelength of
$\mathcalO(D_KL[Q||P])$. Further, we introduce DAD*, an approximate
algorithm based on AD* which retains its favourable runtime and has bias
similar to that of alternative methods. Focusing on VAEs, we propose the
IsoKL VAE (IKVAE), which can be used with DAD* to further improve compression
efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can
losslessly compress images near the theoretically optimal limit.

Stratis Markou, James Requeima, Wessel P. Bruinsma, Anna Vaughan, and
Richard E. Turner.
**Practical conditional
neural processes via tractable dependent predictions**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Conditional Neural Processes (CNPs; Garnelo et
al., 2018) are meta-learning models which leverage the flexibility of deep
learning to produce well-calibrated predictions and naturally handle
off-the-grid and missing data. CNPs scale to large datasets and train with
ease. Due to these features, CNPs appear well-suited to tasks from
environmental sciences or healthcare. Unfortunately, CNPs do not produce
correlated predictions, making them fundamentally inappropriate for many
estimation and decision making tasks. Predicting heat waves or floods, for
example, requires modelling dependencies in temperature or precipitation over
time and space. Existing approaches which model output dependencies, such as
Neural Processes (NPs; Garnelo et al., 2018b) or the FullConvGNP (Bruinsma et
al., 2021), are either complicated to train or prohibitively expensive. What
is needed is an approach which provides dependent predictions, but is simple
to train and computationally tractable. In this work, we present a new class
of Neural Process models that make correlated predictions and support exact
maximum likelihood training that is simple and scalable. We extend the
proposed models by using invertible output transformations, to capture
non-Gaussian output distributions. Our models can be used in downstream
estimation tasks which require dependent function samples. By accounting for
output dependencies, our models show improved predictive performance on a
range of experiments with synthetic and real data.

Angus Phillips, Thomas Seror, Michael Hutchinson, Valentin De Bortoli, Arnaud
Doucet, and Emile Mathieu.
**Spectral diffusion
processes**.
In *NeurIPS workshop on Score-Based Methods*, 2022.

** Abstract:** Score-based generative modelling (SGM) has proven
to be a very effective method for modelling densities on finite-dimensional
spaces. In this work we propose to extend this methodology to learn
generative models over functional spaces. To do so, we represent functional
data in spectral space to dissociate the stochastic part of the processes
from their space-time part. Using dimensionality reduction techniques we then
sample from their stochastic component using finite dimensional SGM. We
demonstrate our method's effectiveness for modelling various multimodal
datasets.

**Sampling and inference
for discrete random probability measures in probabilistic programs**.
In *NIPS workshop on Advances in Approximate Inference*,
California, United States, December 2017.

** Abstract:** We
consider the problem of sampling a sequence from a discrete random prob-
ability measure (RPM) with countable support, under (probabilistic)
constraints of finite memory and computation. A canonical example is sampling
from the Dirichlet Process, which can be accomplished using its well-known
stick-breaking representation and lazy initialization of its atoms. We show
that efficiently lazy initialization is possible if and only if a size-biased
representation of the discrete RPM is known. For models constructed from such
discrete RPMs, we consider the implications for generic particle-based
inference methods in probabilistic program- ming systems. To demonstrate, we
implement posterior inference for Normalized Inverse Gaussian Process mixture
models in Turing.

Jiri Hron, Alexander G. D. G. Matthews, and Zoubin Ghahramani.
**Variational Bayesian dropout:
pitfalls and fixes**.
*ICML*, 2018.

** Abstract:** Dropout, a stochastic
regularisation technique for training of neural networks, has recently been
reinterpreted as a specific type of approximate inference algorithm for
Bayesian neural networks. The main contribution of the reinterpretation is in
providing a theoretical framework useful for analysing and extending the
algorithm. We show that the proposed framework suffers from several issues;
from undefined or pathological behaviour of the true posterior related to use
of improper priors, to an ill-defined variational objective due to
singularity of the approximating distribution relative to the true posterior.
Our analysis of the improper log uniform prior used in variational Gaussian
dropout suggests the pathologies are generally irredeemable, and that the
algorithm still works only because the variational formulation annuls some of
the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL)
divergence, a new approximate inference objective for approximation of
high-dimensional distributions. We show that motivations for variational
Bernoulli dropout based on discretisation and noise have QKL as a limit.
Properties of QKL are studied both theoretically and on a simple practical
example which shows that the QKL-optimal approximation of a full rank
Gaussian with a degenerate one naturally leads to the Principal Component
Analysis solution.

Alexander G. D. G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and
Zoubin Ghahramani.
**Gaussian process behaviour in
wide deep neural networks**.
*ICLR*, 2018.

** Abstract:** Whilst deep neural networks
have shown great empirical success, there is still much work to be done to
understand their theoretical properties. In this paper, we study the
relationship between random, wide, fully connected, feedforward networks with
more than one hidden layer and Gaussian processes with a recursive kernel
definition. We show that, under broad conditions, as we make the architecture
increasingly wide, the implied random function converges in distribution to a
Gaussian process, formalising and extending existing results by Neal (1996)
to deep networks. To evaluate convergence rates empirically, we use maximum
mean discrepancy. We then compare finite Bayesian deep networks from the
literature to Gaussian processes in terms of the key predictive quantities of
interest, finding that in some cases the agreement can be very close. We
discuss the desirability of Gaussian process behaviour and review
non-Gaussian alternative models from the literature.

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, and Zoubin
Ghahramani.
**Sample-then-optimise
posterior sampling for Bayesian linear models**.
*AABI (NeurIPS workshop)*, 2017.

** Abstract:** In modern
machine learning it is common to train models which have an extremely high
intrinsic capacity. The results obtained are often i nitialization dependent,
are different for disparate optimizers and in some cases have no explicit
regularization. This raises difficult questions about generalization. A
natural approach to questions of generalization is a Bayesian one. There is
therefore a growing literature attempting to understand how Bayesian
posterior inference could emerge from the complexity of modern practice, even
without having such a procedure as the stated goal. In this work we consider
a simple special case where exact Bayesian posterior sampling emerges from
sampling (cf initialization) and then gradient descent. Specifically, for a
Bayesian linear model, if we parameterize it in terms of a deterministic
function of an isotropic normal prior, then the action of sampling from the
prior followed by first order optimization of the squared loss will give a
posterior sample. Although the assumptions are stronger than many real
problems, it still exhibits the challenging properties of redundant model
capacity and a lack of explicit regularizers, along with initialization and
optimizer dependence. It is therefore an interesting controlled test case.
Given its simplicity, the method itself may turn out to be of independent
interest from our original goal.

Alexander G D G Matthews, James Hensman, Richard E. Turner, and Zoubin
Ghahramani.
**On Sparse Variational methods
and the Kullback-Leibler divergence between stochastic processes**.
In *19th International Conference on Artificial Intelligence and
Statistics*, Cadiz, Spain, May 2016.

** Abstract:** The
variational framework for learning inducing variables (Titsias, 2009a) has
had a large impact on the Gaussian process literature. The framework may be
interpreted as minimizing a rigorously defined Kullback-Leibler divergence
between the approximating and posterior processes. To our knowledge this
connection has thus far gone unremarked in the literature. In this paper we
give a substantial generalization of the literature on this topic. We give a
new proof of the result for infinite index sets which allows inducing points
that are not data points and likelihoods that depend on all function values.
We then discuss augmented index sets and show that, contrary to previous
works, marginal consistency of augmentation is not enough to guarantee
consistency of variational inference with the original model. We then
characterize an extra condition where such a guarantee is obtainable. Finally
we show how our framework sheds light on interdomain sparse approximations
and sparse approximations for Cox processes.

James Hensman, Alexander G D G Matthews, Maurizio Filippone, and Zoubin
Ghahramani.
**MCMC
for Variationally Sparse Gaussian Processes**.
In *Advances in Neural Information Processing Systems 28*, pages 1-9,
Montreal, Canada, December 2015.

** Abstract:** Gaussian
process (GP) models form a core part of probabilistic machine learning.
Considerable research effort has been made into attacking three issues with
GP models: how to compute efficiently when the number of data is large; how
to approximate the posterior when the likelihood is not Gaussian and how to
estimate covariance function parameter posteriors. This paper simultaneously
addresses these, using a variational approximation to the posterior which is
sparse in support of the function but otherwise free-form. The result is a
Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian
approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse
GPs. Code to replicate each experiment in this paper will be available
shortly.

James Hensman, Alexander G D G Matthews, and Zoubin Ghahramani.
**Scalable
Variational Gaussian Process Classification**.
In *18th International Conference on Artificial Intelligence and
Statistics*, pages 1-9, San Diego, California, USA, May 2015.

** Abstract:** Gaussian process classification is a popular
method with a number of appealing properties. We show how to scale the model
within a variational inducing point framework, out-performing the state of
the art on benchmark datasets. Importantly, the variational formulation an be
exploited to allow classification in problems with millions of data points,
as we demonstrate in experiments.

Alexander G. D. G Matthews, James Hensman, and Zoubin Ghahramani.
**Comparing
lower bounds on the entropy of mixture distributions for use in variational
inference**.
In *NIPS workshop on Advances in Variational Inference*,
Montreal, Canada, December 2014.

Alexander G. D. G. Matthews and Zoubin Ghahramani.
**Classification using log
Gaussian Cox processes**.
*arXiv preprint arXiv:1405.4141*, 2014.

** Abstract:**
McCullagh and Yang (2006) suggest a family of classification algorithms
based on Cox processes. We further investigate the log Gaussian variant which
has a number of appealing properties. Conditioned on the covariates, the
distribution over labels is given by a type of conditional Markov random
field. In the supervised case, computation of the predictive probability of a
single test point scales linearly with the number of training points and the
multiclass generalization is straightforward. We show new links between the
supervised method and classical nonparametric methods. We give a detailed
analysis of the pairwise graph representable Markov random field, which we
use to extend the model to semi-supervised learning problems, and propose an
inference method based on graph min-cuts. We give the first experimental
analysis on supervised and semi-supervised datasets and show good empirical
performance.

Rowan McAllister and Carl Edward Rasmussen.
**Data-efficient
reinforcement learning in continuous state-action
Gaussian-POMDPs**.
In *Advances in Neural Information Processing Systems 31*, Long Beach,
California, December 2017.

** Abstract:** We present a
data-efficient reinforcement learning method for continuous state-action
systems under significant observation noise. Data-efficient solutions under
small noise exist, such as PILCO which learns the cartpole swing-up task in
30s. PILCO evaluates policies by planning state-trajectories using a dynamics
model. However, PILCO applies policies to the observed state, therefore
planning in observation space. We extend PILCO with filtering to instead plan
in belief space, consistent with partially observable Markov decisions
process (POMDP) planning. This enables data-efficient learning under
significant observation noise, outperforming more naive methods such as
post-hoc application of a filter to policies optimised by the original
(unfiltered) PILCO algorithm. We test our method on the cartpole swing-up
task, which involves nonlinear dynamics and requires nonlinear control.

Rowan McAllister, Yarin Gal, Alex Kendall, Mark van der Wilk, Amar Shah,
Roberto Cipolla, and Adrian Weller.
**Concrete problems
for autonomous vehicle safety: Advantages of Bayesian deep
learning,**.
In *International Joint Conference on Artificial Intelligence*,
Melbourne, Australia, August 2017.

** Abstract:** Autonomous
vehicle (AV) software is typically composed of a pipeline of individual
components, linking sensor inputs to motor outputs. Erroneous component
outputs propagate downstream, hence safe AV software must consider the
ultimate effect of each component's errors. Further, improving safety alone
is not sufficient. Passengers must also feel safe to trust and use AV
systems. To address such concerns, we investigate three under-explored themes
for AV research: safety, interpretability, and compliance. Safety can be
improved by quantifying the uncertainties of component outputs and
propagating them forward through the pipeline. Interpretability is concerned
with explaining what the AV observes and why it makes the decisions it does,
building reassurance with the passenger. Compliance refers to maintaining
some control for the passenger. We discuss open challenges for research
within these themes. We highlight the need for concrete evaluation metrics,
propose example problems, and highlight possible solutions.

Rowan McAllister.
**Bayesian learning for
data-efficient control**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2016.

** Abstract:** Applications to learn control of
unfamiliar dynamical systems with increasing autonomy are ubiquitous. From
robotics, to finance, to industrial processing, autonomous learning helps
obviate a heavy reliance on experts for system identification and controller
design. Often real world systems are nonlinear, stochastic, and expensive to
operate (e.g. slow, energy intensive, prone to wear and tear). Ideally
therefore, nonlinear systems can be identified with minimal system
interaction. This thesis considers data efficient autonomous learning of
control of nonlinear, stochastic systems. Data efficient learning critically
requires probabilistic modelling of dynamics. Traditional control approaches
use deterministic models, which easily overfit data, especially small
datasets. We use probabilistic Bayesian modelling to learn systems from
scratch, similar to the PILCO algorithm, which achieved unprecedented data
efficiency in learning control of several benchmarks. We extend PILCO in
three principle ways. First, we learn control under significant observation
noise by simulating a filtered control process using a tractably analytic
framework of Gaussian distributions. In addition, we develop the `latent
variable belief Markov decision process' when filters must predict under
real-time constraints. Second, we improve PILCO's data efficiency by
directing exploration with predictive loss uncertainty and Bayesian
optimisation, including a novel approximation to the Gittins index. Third, we
take a step towards data efficient learning of high-dimensional control using
Bayesian neural networks (BNN). Experimentally we show although filtering
mitigates adverse effects of observation noise, much greater performance is
achieved when optimising controllers with evaluations faithful to reality: by
simulating closed-loop filtered control if executing closed-loop filtered
control. Thus, controllers are optimised w.r.t. how they are used,
outperforming filters applied to systems optimised by unfiltered simulations.
We show directed exploration improves data efficiency. Lastly, we show BNN
dynamics models are almost as data efficient as Gaussian process models.
Results show data efficient learning of high-dimensional control is possible
as BNNs scale to high-dimensional state inputs.

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen,
A. Knoll, and M. P. Deisenroth.
**Policy search
for learning robot control using sparse data**.
In *IEEE International Conference on Robotics and Automation*, pages
3882-3887, Hong Kong, China, 2014. IEEE, doi
10.1109/ICRA.2014.6907422.

** Abstract:** In many complex
robot applications, such as grasping and manipulation, it is difficult to
program desired task solutions beforehand, as robots are within an uncertain
and dynamic environment. In such cases, learning tasks from experience can be
a useful alternative. To obtain a sound learning and generalization
performance, machine learning, especially, reinforcement learning, usually
requires sufficient data. However, in cases where only little data is
available for learning, due to system constraints and practical issues,
reinforcement learning can act suboptimally. In this paper, we investigate
how model-based reinforcement learning, in particular the probabilistic
inference for learning control method (PILCO), can be tailored to cope with
the case of sparse data to speed up learning. The basic idea is to include
further prior knowledge into the learning process. As PILCO is built on the
probabilistic Gaussian processes framework, additional system knowledge can
be incorporated by defining appropriate prior distributions, e.g. a linear
mean Gaussian prior. The resulting PILCO formulation remains in closed form
and analytically tractable. The proposed approach is evaluated in simulation
as well as on a physical robot, the Festo Robotino XT. For the robot
evaluation, we employ the approach for learning an object pick-up task. The
results show that by including prior knowledge, policy learning can be sped
up in presence of sparse data.

Andrew McHutchon.
**Nonlinear modelling and
control using Gaussian processes**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2014.

** Abstract:** In many scientific disciplines it is
often required to make predictions about how a system will behave or to
deduce the correct control values to elicit a particular desired response.
Efficiently solving both of these tasks relies on the construction of a model
capturing the system's operation. In the most interesting situations, the
model needs to capture strongly nonlinear effects and deal with the presence
of uncertainty and noise. Building models for such systems purely based on a
theoretical understanding of underlying physical principles can be infeasibly
complex and require a large number of simplifying assumptions. An alternative
is to use a data-driven approach, which builds a model directly from
observations. A powerful and principled approach to doing this is to use a
Gaussian Process (GP).

In this thesis we start by discussing how GPs can
be applied to data sets which have noise affecting their inputs. We present
the "Noisy Input GP", which uses a simple local-linearisation to refer the
input noise into heteroscedastic output noise, and compare it to other
methods both theoretically and empirically. We show that this technique leads
to a effective model for nonlinear functions with input and output noise. We
then consider the broad topic of GP state space models for application to
dynamical systems. We discuss a very wide variety of approaches for using GPs
in state space models, including introducing a new method based on
moment-matching, which consistently gave the best performance. We analyse the
methods in some detail including providing a systematic comparison between
approximate-analytic and particle methods. To our knowledge such a comparison
has not been provided before in this area. Finally, we investigate an
automatic control learning framework, which uses Gaussian Processes to model
a system for which we wish to design a controller. Controller design for
complex systems is a difficult task and thus a framework which allows an
automatic design directly from data promises to be extremely useful. We
demonstrate that the previously published framework cannot cope with the
presence of observation noise but that the introduction of a state space
model dramatically improves its performance. This contribution, along with
some other suggested improvements opens the door for this framework to be
used in real-world applications.

Andrew McHutchon and Carl Edward Rasmussen.
**Gaussian process
training with input noise**.
In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 24*, pages
1341-1349, Granada, Spain, 2011. Curran Associates, Inc.

**
Abstract:** In standard Gaussian Process regression input locations are
assumed to be noise free. We present a simple yet effective GP model for
training on input points corrupted by i.i.d. Gaussian noise. To make
computations tractable we use a local linear expansion about each input
point. This allows the input noise to be recast as output noise proportional
to the squared gradient of the GP posterior mean. The input noise
hyperparameters are trained alongside other hyperparameters by the usual
method of maximisation of the marginal likelihood, and allow estimation of
the noise levels on each input dimension. Training uses an iterative scheme,
which alternates between optimising the hyperparameters and calculating the
posterior gradient. Analytic predictive moments can then be found for
Gaussian distributed test points. We compare our model to others over a range
of different regression problems and show that it improves over current
methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Shakir Mohamed.
**Generalised Bayesian
matrix factorisation models**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** Factor analysis and related models for
probabilistic matrix factorisation are of central importance to the
unsupervised analysis of data, with a colourful history more than a century
long. Probabilistic models for matrix factorisation allow us to explore the
underlying structure in data, and have relevance in a vast number of
application areas including collaborative filtering, source separation,
missing data imputation, gene expression analysis, information retrieval,
computational finance and computer vision, amongst others.

This thesis
develops generalisations of matrix factorisation models that advance our
understanding and enhance the applicability of this important class of
models. The generalisation of models for matrix factorisation focuses on
three concerns: widening the applicability of latent variable models to the
diverse types of data that are currently available; considering alternative
structural forms in the underlying representations that are inferred; and
including higher order data structures into the matrix factorisation
framework. These three issues reflect the reality of modern data analysis and
we develop new models that allow for a principled exploration and use of data
in these settings. We place emphasis on Bayesian approaches to learning and
the advantages that come with the Bayesian methodology. Our port of departure
is a generalisation of latent variable models to members of the exponential
family of distributions. This generalisation allows for the analysis of data
that may be real-valued, binary, counts, non-negative or a heterogeneous set
of these data types. The model unifies various existing models and constructs
for unsupervised settings, the complementary framework to the generalised
linear models in regression.

Moving to structural considerations, we
develop Bayesian methods for learning sparse latent representations. We
define ideas of weakly and strongly sparse vectors and investigate the
classes of prior distributions that give rise to these forms of sparsity,
namely the scale-mixture of Gaussians and the spike-and-slab distribution.
Based on these sparsity favouring priors, we develop and compare methods for
sparse matrix factorisation and present the first comparison of these sparse
learning approaches. As a second structural consideration, we develop models
with the ability to generate correlated binary vectors. Moment-matching is
used to allow binary data with specified correlation to be generated, based
on dichotomisation of the Gaussian distribution. We then develop a novel and
simple method for binary PCA based on Gaussian dichotomisation. The third
generalisation considers the extension of matrix factorisation models to
multi-dimensional arrays of data that are increasingly prevalent. We develop
the first Bayesian model for non-negative tensor factorisation and explore
the relationship between this model and the previously described models for
matrix factorisation.

**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

Mikkel N. Schmidt and Shakir Mohamed.
**Probabilistic
non-negative tensor factorization using Markov chain Monte
Carlo**.
In *European Signal Processing Conference (EUSIPCO)*, pages 1918-1922,
Glasgow, Scotland, August 2009.

** Abstract:** We present a
probabilistic model for learning non-negative tensor factorizations (NTF), in
which the tensor factors are latent variables associated with each data
dimension. The non-negativity constraint for the latent factors is handled by
choosing priors with support on the non-negative numbers. Two Bayesian
inference procedures based on Markov chain Monte Carlo sampling are
described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We
evaluate the model on two food science data sets, and show that the
probabilistic NTF model leads to better predictions and avoids overfitting
compared to existing NTF approaches.

** Comment:** Rated by reviewers amongst the top 5% of the
presented papers.

**Adapting the
linearised Laplace model evidence for modern deep learning**.
In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang
Niu, and Sivan Sabato, editors, *39th International Conference on Machine
Learning*, volume 162 of *Proceedings of Machine Learning
Research*, pages 796-821. PMLR, 2022.

** Abstract:**
The linearised Laplace method for estimating model uncertainty has received
renewed attention in the Bayesian deep learning community. The method
provides reliable error bars and admits a closed-form expression for the
model evidence, allowing for scalable selection of model hyperparameters. In
this work, we examine the assumptions behind this method, particularly in
conjunction with model selection. We show that these interact poorly with
some now-standard tools of deep learning-stochastic approximation methods
and normalisation layers-and make recommendations for how to better adapt
this classic method to the modern setting. We provide theoretical support for
our recommendations and validate them empirically on MLPs, classic CNNs,
residual networks with and without normalisation layers, generative
autoencoders and transformers.

**Bayesian deep
learning via subnetwork inference**.
In Marina Meila and Tong Zhang, editors, *32nd International Conference on
Machine Learning*, volume 139 of *Proceedings of Machine Learning
Research*, pages 2510-2521. PMLR, 2021.

** Abstract:**
The Bayesian paradigm has the potential to solve core issues of deep neural
networks such as poor calibration and data inefficiency. Alas, scaling
Bayesian inference to large weight spaces often requires restrictive
approximations. In this work, we show that it suffices to perform inference
over a small subset of model weights in order to obtain accurate predictive
posteriors. The other weights are kept as point estimates. This subnetwork
inference framework enables us to use expressive, otherwise intractable,
posterior approximations over such subsets. In particular, we implement
subnetwork linearized Laplace: We first obtain a MAP estimate of all weights
and then infer a full-covariance Gaussian posterior over a subnetwork. We
propose a subnetwork selection strategy that aims to maximally preserve the
model's predictive uncertainty. Empirically, our approach is effective
compared to ensembles and less expressive posterior approximations over full
networks.

Eric Nalisnick, José Miguel Hernández-Lobato, and Padhraic Smyth.
**Dropout as a
structured shrinkage prior**.
In *36th International Conference on Machine Learning*, Long Beach, June
2019.

** Abstract:** Dropout regularization of deep neural
networks has been a mysterious yet effective tool to prevent overfitting.
Explanations for its success range from the prevention of co-adapted weights
to it being a form of cheap Bayesian inference. We propose a novel framework
for understanding multiplicative noise in neural networks, considering
continuous distributions as well as Bernoulli noise (i.e. dropout). We show
that multiplicative noise induces structured shrinkage priors on a network's
weights. We derive the equivalence through reparametrization properties of
scale mixtures and without invoking any approximations. Given the
equivalence, we then show that dropout's Monte Carlo training objective
approximates marginal MAP estimation. We leverage these insights to propose a
novel shrinkage framework for resnets, terming the prior 'automatic depth
determination' as it is the natural analog of automatic relevance
determination for network depth. Lastly, we investigate two inference
strategies that improve upon the aforementioned MAP approximation in
regression benchmarks.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, and Jose Miguel
Hernández-Lobato.
**Bayesian
batch active learning as sparse subset approximation**.
In *Advances in Neural Information Processing Systems 33*, 2019.

** Abstract:** Leveraging the wealth of unlabeled data produced
in recent years provides great potential for improving supervised models.
When the cost of acquiring labels is high, probabilistic active learning
methods can be used to greedily select the most informative data points to be
labeled. However, for many large-scale problems standard greedy procedures
become computationally infeasible and suffer from negligible model change. In
this paper, we introduce a novel Bayesian batch active learning approach that
mitigates these issues. Our approach is motivated by approximating the
complete data posterior of the model parameters. While naive batch
construction methods result in correlated queries, our algorithm produces
diverse batches that enable efficient active learning at scale. We derive
interpretable closed-form solutions akin to existing active learning
procedures for linear models, and generalize to arbitrary models using random
projections. We demonstrate the benefits of our approach on several
large-scale regression and classification tasks.

Alexandre Khae Wu Navarro, Jes Frellsen, and Richard E. Turner.
**The Multivariate Generalised
von Mises distribution: Inference and applications**.
In *31st AAAI Conference on Artificial Intelligence*, San Francisco, CA,
USA, January 2017. AAAI Press.

** Abstract:** Circular
variables arise in a multitude of data-modelling contexts ranging from
robotics to the social sciences, but they have been largely overlooked by the
machine learning community. This paper partially redresses this imbalance by
extending some standard probabilistic modelling tools to the circular domain.
First we introduce a new multivariate distribution over circular variables,
called the multivariate Generalised von Mises (mGvM) distribution. This
distribution can be constructed by restricting and renormalising a general
multivariate Gaussian distribution to the unit hyper-torus. Previously
proposed multivariate circular distributions are shown to be special cases of
this construction. Second, we introduce a new probabilistic model for
circular regression, that is inspired by Gaussian Processes, and a method for
probabilistic principal component analysis with circular hidden variables.
These models can leverage standard modelling tools (e.g. covariance functions
and methods for automatic relevance determination). Third, we show that the
posterior distribution in these models is a mGvM distribution which enables
development of an efficient variational free-energy scheme for performing
approximate inference and approximate maximum-likelihood learning.

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel,
Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, and Laurence
Aitchison.
**Bayesian neural network
priors revisited**.
In *10th International Conference on Learning Representations*, 2022.

** Abstract:** Isotropic Gaussian priors are the de facto
standard for modern Bayesian neural network inference. However, it is unclear
whether these priors accurately reflect our true beliefs about the weight
distributions or give optimal performance. To find better priors, we study
summary statistics of neural network weights in networks trained using
stochastic gradient descent (SGD). We find that convolutional neural network
(CNN) and ResNet weights display strong spatial correlations, while fully
connected networks (FCNNs) display heavy-tailed weight distributions. We show
that building these observations into priors can lead to improved performance
on a variety of image classification datasets. Surprisingly, these priors
mitigate the cold posterior effect in FCNNs, but slightly increase the cold
posterior effect in ResNets.

Henry B. Moss, Sebastian W. Ober, and Victor Picheny.
**Information-theoretic
inducing point placement for high-throughput Bayesian optimisation**.
In *ICML Workshop on Adaptive Experimental Design and Active Learning in the
Real World (RealML)*, 2022.

** Abstract:** Sparse Gaussian
Processes are a key component of high-throughput Bayesian optimisation (BO)
loops — an increasingly common setting where evaluation budgets are large
and highly parallelised. By using representative subsets of the available
data to build approximate posteriors, sparse models dramatically reduce the
computational costs of surrogate modelling by relying on a small set of
pseudo-observations, the so-called inducing points, in lieu of the full data
set. However, current approaches to design inducing points are not
appropriate within BO loops as they seek to reduce global uncertainty in the
objective function. Thus, the high-fidelity modelling of promising and
data-dense regions required for precise optimisation is sacrificed and
computational resources are instead wasted on modelling areas of the space
already known to be sub-optimal. Inspired by entropy-based BO methods, we
propose a novel inducing point design that uses a principled
information-theoretic criterion to select inducing points. By choosing
inducing points to maximally reduce both global uncertainty and uncertainty
in the maximum value of the objective function, we build surrogate models
able to support high-precision high-throughput BO.

Pola E. Schwöbel, Martin Jørgensen, Sebastian W. Ober, and Mark van der
Wilk.
**Last
layer marginal likelihood for invariance learning**.
In *25th International Conference on Artificial Intelligence and
Statistics*, 2022.

** Abstract:** Data augmentation is
often used to incorporate inductive biases into models. Traditionally, these
are hand-crafted and tuned with cross validation. The Bayesian paradigm for
model selection provides a path towards end-to-end learning of invariances
using only the training data, by optimising the marginal likelihood.
Computing the marginal likelihood is hard for neural networks, but success
with tractable approaches that compute the marginal likelihood for the last
layer only raises the question of whether this convenient approach might be
employed for learning invariances. We show partial success on standard
benchmarks, in the low-data regime and on a medical imaging dataset by
designing a custom optimisation routine. Introducing a new lower bound to the
marginal likelihood allows us to perform inference for a larger class of
likelihood functions than before. On the other hand, we demonstrate failure
modes on the CIFAR10 dataset, where the last layer approximation is not
sufficient due to the increased complexity of our neural network. Our results
indicate that once more sophisticated approximations become available the
marginal likelihood is a promising approach for invariance learning in neural
networks.

Laurence Aitchison, Adam X. Yang, and Sebastian W. Ober.
**Deep
kernel processes**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We define deep kernel processes in which positive
definite Gram matrices are progressively transformed by nonlinear kernel
functions and by sampling from (inverse) Wishart distributions. Remarkably,
we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs),
infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep
kernel processes. For DGPs the equivalence arises because the Gram matrix
formed by the inner product of features is Wishart distributed, and as we
show, standard isotropic kernels can be written entirely in terms of this
Gram matrix — we do not need knowledge of the underlying features. We
define a tractable deep kernel process, the deep inverse Wishart process, and
give a doubly-stochastic inducing-point variational inference scheme that
operates on the Gram matrices, not on the features, as in DGPs. We show that
the deep inverse Wishart process gives superior performance to DGPs and
infinite BNNs on fully-connected baselines.

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, and Mark van der
Wilk.
**Understanding
variational inference in function-space**.
In *3rd Symposium on Advances in Approximate Bayesian Inference*,
2021.

** Abstract:** Recent work has attempted to directly
approximate the ‘function-space’ or predictive posterior distribution of
Bayesian models, without approximating the posterior distribution over the
parameters. This is appealing in e.g. Bayesian neural networks, where we only
need the former, and the latter is hard to represent. In this work, we
highlight some advantages and limitations of employing the Kullback-Leibler
divergence in this setting. For example, we show that minimizing the KL
divergence between a wide class of parametric distributions and the posterior
induced by a (non-degenerate) Gaussian process prior leads to an ill-defined
objective function. Then, we propose (featurized) Bayesian linear regression
as a benchmark for ‘function-space’ inference methods that directly
measures approximation quality. We apply this methodology to assess aspects
of the objective function and inference scheme considered in Sun et al.
(2018), emphasizing the quality of approximation to Bayesian inference as
opposed to predictive performance.

Sebastian W. Ober and Laurence Aitchison.
**Global
inducing point variational posteriors for Bayesian neural networks and deep
Gaussian processes**.
In *38th International Conference on Machine Learning*, 2021.

** Abstract:** We consider the optimal approximate posterior
over the top-layer weights in a Bayesian neural network for regression, and
show that it exhibits strong dependencies on the lower-layer weights. We
adapt this result to develop a correlated approximate posterior over the
weights at all layers in a Bayesian neural network. We extend this approach
to deep Gaussian processes, unifying inference in the two model classes. Our
approximate posterior uses learned ``global'' inducing points, which are
defined only at the input layer and propagated through the network to obtain
inducing inputs at subsequent layers. By contrast, standard ``local'',
inducing point methods from the deep Gaussian process literature optimise a
separate set of inducing inputs at every layer, and thus do not model
correlations across layers. Our method gives state-of-the-art performance for
a variational Bayesian method, without data augmentation or tempering, on
CIFAR-10 of 86.7%, which is comparable to SGMCM