Department of Engineering

University of Cambridge

[ Bratières | Cunningham | Davies | Duvenaud | Eaton | Frigola | Van Gael | Gal | Ghahramani | Heaukulani | Heller | Hernández-Lobato | Houlsby | Huszár | Knowles | Lacoste-Julien | Lloyd | Lopez-Paz | McHutchon | Mohamed | Orbanz | Ortega | Palla | Quadrianto | Rasmussen | Roy | Saatçi | Rich Turner | Ryan Turner | Snelson | Williamson | Wilson ] |

[ 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | past millenea ] |

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani.
**Avoiding pathologies in very
deep networks**.
In *17th International Conference on Artificial Intelligence and
Statistics*, Reykjavik, Iceland, April 2014.

**
Abstract:** Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive performance. To
shed light on this problem, we analyze the analogous problem of constructing
useful priors on compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural network. We show
that in standard architectures, the representational capacity of the network
tends to capture fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We propose an
alternate network architecture which does not suffer from this pathology. We
also examine deep covariance functions, obtained by composing infinitely many
feature transforms. Lastly, we characterize the class of models obtained by
performing dropout on Gaussian processes.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward
Rasmussen.
**Bayesian inference and learning in
Gaussian process state-space models with particle MCMC**.
In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger,
editors, *Advances in Neural Information Processing Systems 26*.
Curran Associates, Inc., 2014.

** Abstract:** State-space
models are successfully used in many areas of science, engineering and
economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning in nonlinear nonparametric
state-space models. We place a Gaussian process prior over the transition
dynamics, resulting in a flexible model able to capture complex dynamical
phenomena. However, to enable efficient inference, we marginalize over the
dynamics of the model and instead infer directly the joint smoothing
distribution through the use of specially tailored Particle Markov Chain
Monte Carlo samplers. Once a sample from the smoothing distribution is
computed, the state transition predictive distribution can be formulated
analytically. We make use of sparse Gaussian process models to greatly reduce
the computational complexity of the approach.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Automatic construction and
natural-language description of nonparametric regression models**.
Technical Report arXiv:1402.4304 [stat.ML], February 2014.

**
Abstract:** This paper presents the beginnings of an automatic
statistician, focusing on regression problems. Our system explores an
open-ended space of possible statistical models to discover a good
explanation of the data, and then produces a detailed report with figures and
natural-language text. Our approach treats unknown functions
nonparametrically using Gaussian processes, which has two important
consequences. First, Gaussian processes model functions in terms of
high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models,
this allows us to automatically describe functions through a decomposition
into additive parts. Second, the use of flexible nonparametric models and a
rich language for composing them in an open-ended manner also results in
state-of-the-art extrapolation performance evaluated over 13 real time series
data sets from various domains.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen.
**Gaussian processes for data-efficient learning in robotics and control**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2013,
doi
10.1109/TPAMI.2013.218.

** Abstract:** Autonomous learning
has been a promising direction in control and robotics for more than a decade
since data-driven learning allows to reduce the amount of engineering
knowledge, which is otherwise required. However, autonomous reinforcement
learning (RL) approaches typically require many interactions with the system
to learn controllers, which is a practical limitation in real systems, such
as robots, where many interactions can be impractical and time consuming. To
address this problem, current learning approaches typically require
task-speciﬁc knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or speciﬁc knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up
learning by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the
system. By explicitly incorporating model uncertainty into long-term planning
and controller learning our approach reduces the effects of model errors, a
key problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

** Comment:** accepted

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and
Zoubin Ghahramani.
**Structure
discovery in nonparametric regression through compositional kernel
search**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Despite its
importance, choosing the structural form of the kernel in nonparametric
regression remains a black art. We define a space of kernel structures which
are built compositionally by adding and multiplying a small number of base
kernels. We present a method for searching over this space of structures
which mirrors the scientific discovery process. The learned structures can
often decompose functions into interpretable components and enable long-range
extrapolation on time-series datasets. Our structure search method
outperforms many widely used kernels and kernel combination methods on a
variety of prediction tasks.

Roger Frigola and Carl Edward Rasmussen.
**Integrated pre-processing for
Bayesian nonlinear system identification with Gaussian processes**.
In *Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on*,
2013.

** Abstract:** We introduce GP-FNARX: a new model for
nonlinear system identification based on a nonlinear autoregressive exogenous
model (NARX) with filtered regressors (F) where the nonlinear regression
problem is tackled using sparse Gaussian processes (GP). We integrate data
pre-processing with system identification into a fully automated procedure
that goes from raw data to an identified model. Both pre-processing
parameters and GP hyper-parameters are tuned by maximizing the marginal
likelihood of the probabilistic model. We obtain a Bayesian model of the
system's dynamics which is able to report its uncertainty in regions where
the data is scarce. The automated approach, the modeling of uncertainty and
its relatively low computational cost make of GP-FNARX a good candidate for
applications in robotics and adaptive control.

Yarin Gal and Phil Blunsom.
**A systematic
Bayesian treatment of the IBM alignment models**.
In *Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies*.
Association for Computational Linguistics, 2013.

**
Abstract:** The dominant yet ageing IBM and HMM word alignment models
underpin most popular Statistical Machine Translation implementations in use
today. Though beset by the limitations of implausible independence
assumptions, intractable optimisation problems, and an excess of tunable
parameters, these models provide a scalable and reliable starting point for
inducing translation systems. In this paper we build upon this venerable base
by recasting these models in the non-parametric Bayesian framework. By
replacing the categorical distributions at their core with hierarchical
Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we
provide a more flexible formulation and sidestep the original heuristic
optimisation techniques. The resulting models are highly extendible,
naturally permitting the introduction of phrasal dependencies. We present
extensive experimental results showing improvements in both AER and BLEU when
benchmarked against Giza++, including significant improvements over IBM model
4.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling
multidimensional Gaussian processes using projected additive
approximations**.
In *30th International Conference on Machine Learning*, 2013.

** Abstract:** Exact Gaussian Process (GP) regression has
O(N^{3}) runtime for data size N, making it intractable for large N.
Many algorithms for improving GP scaling approximate the covariance with
lower rank matrices. Other work has exploited structure inherent in
particular covariance functions, including GPs with implied Markov structure,
and equispaced inputs (both enable O(N) runtime). However, these GP advances
have not been extended to the multidimensional input setting, despite the
preponderance of multidimensional applications. This paper introduces and
tests novel extensions of structured GPs to multidimensional inputs. We
present new methods for additive GPs, showing a novel connection between the
classic backﬁtting method and the Bayesian framework. To achieve optimal
accuracy-complexity tradeoff, we extend this model with a novel variant of
projection pursuit regression. Our primary result – projection pursuit
Gaussian Process Regression – shows orders of magnitude speedup while
preserving high accuracy. The natural second and third steps include
non-Gaussian observations and higher dimensional equispaced grid methods. We
introduce novel techniques to address both of these necessary directions. We
thoroughly illustrate the power of these three advances on several datasets,
achieving close performance to the naive Full GP at orders of magnitude less
cost.

E. Gilboa, Yunus Saatçi, and John P. Cunningham.
**Scaling multidimensional inference for structured Gaussian processes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2013.

** Comment:** to appear

Creighton Heaukulani and Zoubin Ghahramani.
**Dynamic probabilistic models for latent feature propagation in social
networks**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Current Bayesian
models for dynamic social network data have focused on modelling the
influence of evolving unobserved structure on observed social interactions.
However, an understanding of how observed social relationships from the past
affect future unobserved structure in the network has been neglected. In this
paper, we introduce a new probabilistic model for capturing this phenomenon,
which we call latent feature propagation, in social networks. We demonstrate
our model's capability for inferring such latent structure in varying types
of social network datasets, and experimental studies show this structure
achieves higher predictive performance on link prediction and forecasting
tasks.

Daniel Hernández-Lobato and José Miguel Hernández-Lobato.
**Learning feature selection
dependencies in multi-task learning**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** A
probabilistic model based on the horseshoe prior is proposed for learning de-
pendencies in the process of identifying relevant features for prediction.
Exact inference is intractable in this model. However, expectation
propagation offers an approximate alternative. Because the process of
estimating feature selection dependencies may suffer from over-fitting in the
model proposed, additional data from a multi-task learning scenario are
considered for induction. The same model can be used in this setting with few
modifications. Furthermore, the assumptions made are less restrictive than in
other multi-task methods: The different tasks must share feature selection
dependencies, but can have different relevant features and model
coefficients. Experiments with real and synthetic data show that this model
performs better than other multi-task alternatives from the literature. The
experiments also show that the model is able to induce suitable feature
selection dependencies for the problems considered, only from the training
data.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Generalized spike-and-slab priors for Bayesian group feature
selection using expectation propagation**.
*Journal of Machine Learning Research*, 14:1891-1945, July 2013.

** Abstract:** We describe a Bayesian method for group feature
selection in linear regression problems. The method is based on a generalized
version of the standard spike-and-slab prior distribution which is often used
for individual feature selection. Exact Bayesian inference under the prior
considered is infeasible for typical regression problems. However,
approximate inference can be carried out efficiently using Expectation
Propagation (EP). A detailed analysis of the generalized spike-and-slab
prior shows that it is well suited for regression problems that are sparse at
the group level. Furthermore, this prior can be used to introduce prior
knowledge about specific groups of features that are a priori believed to be
more relevant. An experimental evaluation compares the performance of the
proposed method with those of group LASSO, Bayesian group LASSO,
automatic relevance determination and additional variants used for group
feature selection. The results of these experiments show that a model based
on the generalized spike-and-slab prior and the EP algorithm has
state-of-the-art prediction performance in the problems analyzed.
Furthermore, this model is also very useful to carry out sequential
experimental design (also known as active learning), where the data instances
that are most informative are iteratively included in the training set,
reducing the number of instances needed to obtain a particular level of
prediction accuracy.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani.
**Stochastic inference for scalable probabilistic modeling of binary
matrices**.
In *NIPS Workshop on Randomized Methods for Machine Learning*, 2013.

** Abstract:** Fully observed large binary matrices appear in a
wide variety of contexts. To model them, probabilistic matrix factorization
(PMF) methods are an attractive solution. However, current batch algorithms
for PMF can be inefficient since they need to analyze the entire data matrix
before producing any parameter updates. We derive an efficient stochastic
inference algorithm for PMF models of fully observed binary matrices. Our
method exhibits faster convergence rates than more expensive batch approaches
and has better predictive performance than scalable alternatives. The
proposed method includes new data subsampling strategies which produce large
gains over standard uniform subsampling. We also address the task of
automatically selecting the size of the minibatches of data and we propose an
algorithm that adjusts this hyper-parameter in an online manner.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel
Hernández-Lobato.
**Gaussian process
conditional copulas with applications to financial time series**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** The
estimation of dependencies between multiple variables is a central problem in
the analysis of financial time series. A common approach is to express these
dependencies in terms of a copula function. Typically the copula function is
assumed to be constant but this may be inaccurate when there are covariates
that could have a large influence on the dependence structure of the data. To
account for this, a Bayesian framework for the estimation of conditional
copulas is proposed. In this framework the parameters of a copula are
non-linearly related to some arbitrary conditioning variables. We evaluate
the ability of our method to predict time-varying dependencies on several
equities and currencies and observe consistent performance gains compared to
static copula models and other time-varying copula methods.

Neil Houlsby and Guy Houlsby.
**Statistical fitting of undrained strength data**.
*Geotechnique*, 63(13):1253-1263, 2013, doi
10.1680/geot.13.P.007.

** Abstract:** We describe an
approach, based on Bayesian statistical methods, that allows the fitting of a
design profile to a set of measurements of undrained strengths. In particular
we allow for the automatic determination of not only the positions of
boundaries between geological units, but also the selection of the number of
units to model the data in an appropriate way.

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán,
Daniel M Wolpert, and Máté Lengyel.
**Cognitive tomography reveals complex task-independent mental
representations**.
*Current Biology*, 23(21):2169-2175, 2013, doi
10.1016/j.cub.2013.09.012.

** Abstract:** Humans develop
rich mental representations that guide their behavior in a variety of
every-day tasks. However, it is unknown whether these representations, often
formalized as priors in Bayesian inference, are specific for each task or
subserve multiple tasks. Current approaches cannot distinguish between these
two possibilities because they cannot extract comparable representations
across different tasks. Here, we develop a novel method, termed cognitive
tomography, that can extract complex, multi-dimensional priors across tasks.
We apply this method to human judgments in two qualitatively different tasks,
familiarity and odd-one-out, involving an ecologically relevant set of
stimuli, human faces. We show that priors over faces are structurally complex
and vary dramatically across subjects, but are invariant across the tasks
within each subject. The priors we extract from each task allow us to predict
with high precision the behavior of subjects for novel stimuli both in the
same task as well as in the other task. Our results provide the first
evidence for a single high-dimensional structured representation of a
naturalistic stimulus set that guides behavior in multiple tasks. Moreover,
the representations estimated by cognitive tomography can provide
independent, behavior-based regressors for elucidating the neural correlates
of complex naturalistic priors.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani.
**Warped mixtures for nonparametric
cluster shapes**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, Washington, July 2013.

** Abstract:** A mixture of
Gaussians fit to a single curved or heavy-tailed cluster will report that the
data contains many clusters. To produce more appropriate clusterings, we
introduce a model which warps a latent mixture of Gaussians to produce
nonparametric cluster shapes. The possibly low-dimensional latent mixture
model allows us to summarize the properties of the high-dimensional clusters
(or density manifolds) describing the data. The number of manifolds, as well
as the shape and dimension of each manifold is automatically inferred. We
derive a simple inference scheme for this model which analytically integrates
out both the mixture parameters and the warping function. We show that our
model is effective for density estimation, performs better than infinite
Gaussian mixture models at recovering the true number of clusters, and
produces interpretable summaries of high-dimensional datasets.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani.
**Active
learning for interactive visualization**.
In *16th International Conference on Artificial Intelligence and
Statistics*, 2013.

** Abstract:** Many automatic
visualization methods have been proposed. However, a visualization that is
automatically generated might be different to how a user wants to arrange the
objects in visualization space. By allowing users to re-locate objects in the
embedding space of the visualization, they can adjust the visualization to
their preference. We propose an active learning framework for interactive
visualization which selects objects for the user to re-locate so that they
can obtain their desired visualization by re-locating as few as possible. The
framework is based on an information theoretic criterion, which favors
objects that reduce the uncertainty of the visualization. We present a
concrete application of the proposed framework to the Laplacian eigenmap
visualization method. We demonstrate experimentally that the proposed
framework yields the desired visualization with fewer user interactions than
existing methods.

Michael Kaschesky, Pawel Sobkowicz, José Miguel Hernández-Lobato, Guillaume
Bouchard, Cedric Archambeau, Nicolas Scharioth, Robert Manchin, Adrian
Gschwend, and Reinhard Riedl.
**Bringing representativeness into social media monitoring and
analysis**.
In *46th Hawaii International Conference on System Sciences*, Manoa,
Hawaii, 2013.

** Abstract:** The opinions, expectations and
behavior of citizens are increasingly reflected online - therefore, mining
the internet for such data can enhance decision-making in public policy,
communications, marketing, finance and other fields. However, to come closer
to the representativeness of classic opinion surveys there is a lack of
knowledge about the sociodemographic characteristics of those voicing
opinions on the internet. This paper proposes to calibrate online opinions
aggregated from multiple and heterogeneous data sources with traditional
surveys enhanced with rich socio-demographic information to enable insights
into which opinions are expressed on the internet by specific segments of
society. The goal of this research is to provide professionals in citizen-
and consumer- centered domains with more concise near real-time intelligence
on online opinions. To become effective, the methodologies presented in this
paper must be integrated into a coherent decision support system.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc
Huszár, and Sergey Kulik.
**Experimental
adaptive Bayesian tomography**.
*Physical Review A*, 87(6):062122, 2013.

** Abstract:**
We report an experimental realization of an adaptive quantum state tomography
protocol. Our method takes advantage of a Bayesian approach to statistical
inference and is naturally tailored for adaptive strategies. For pure states
we observe close to $N^-1$ scaling of infidelity with overall number of
registered events, while best non-adaptive protocols allow for $N^-1/2$
scaling only. Experiments are performed for polarization qubits, but the
approach is readily adapted to any dimension.

James Robert Lloyd.
**Gefcom2012
hierarchical load forecasting: Gradient boosting machines and gaussian
processes**.
*International Journal of Forecasting*, 2013.

**
Abstract:** This report discusses methods for forecasting hourly loads of a
US utility as part of the load forecasting track of the Global Energy
Forecasting Competition 2012 hosted on Kaggle. The methods described
(gradient boosting machines and Gaussian processes) are generic machine
learning / regression algorithms and few domain specific adjustments were
made. Despite this, the algorithms were able to produce highly competitive
predictions and hopefully they can inspire more reﬁned techniques to
compete with state-of-the-art load forecasting methodologies.

David Lopez-Paz, Philipp Hennig, and Bernhard Scholköpf.
**The randomized dependence
coefficient**.
In *Advances in Neural Information Processing Systems 27*, pages 1-9,
Lake Tahoe, California, USA, December 2013.

** Abstract:** We
introduce the Randomized Dependence Coefficient (RDC), a measure of
non-linear dependence between random variables of arbitrary dimension based
on the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. RDC
is defined in terms of correlation of random non-linear copula projections;
it is invariant with respect to marginal distribution transformations, has
low computational cost and is easy to implement: just five lines of R code,
included at the end of the paper.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Gaussian
process vine copulas for multivariate dependence**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** Copulas allow to learn
marginal distributions separately from the multivariate dependence structure
(copula) that links them together into a density function. Vine
factorizations ease the learning of high-dimensional copulas by constructing
a hierarchy of conditional bivariate copulas. However, to simplify inference,
it is common to assume that each of these conditional bivariate copulas is
independent from its conditioning variables. In this paper, we relax this
assumption by discovering the latent functions that specify the shape of a
conditional copula given its conditioning variables We learn these functions
by following a Bayesian approach based on sparse Gaussian processes with
expectation propagation for scalable, approximate inference. Experiments on
real-world datasets show that, when modeling all conditional dependencies, we
obtain better estimates of the underlying copula of the data.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen,
Stephen J. Roberts, and Zoubin Ghahramani.
**Active
learning of model evidence using Bayesian quadrature**.
In *Advances in Neural Information Processing Systems 25*, pages 46-54,
Lake Tahoe, California, USA, December 2013.

** Abstract:**
Numerical integration is a key component of many problems in scientiﬁc
computing, statistical modelling, and machine learning. Bayesian Quadrature
is a model-based method for numerical integration which, relative to standard
Monte Carlo methods, offers increased sample efficiency and a more robust
estimate of the uncertainty in the estimated integral. We propose a novel
Bayesian Quadrature approach for numerical integration when the integrand is
non-negative, such as the case of computing the marginal likelihood,
predictive distribution, or normalising constant of a probabilistic model.
Our approach approximately marginalises the quadrature model’s
hyperparameters in closed form, and introduces an active learning scheme to
optimally select function evaluations, as opposed to using Monte Carlo
samples. We demonstrate our method on both a number of synthetic benchmarks
and a real scientiﬁc problem from astronomy.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani.
**The supervised
ibp: Neighbourhood preserving infinite latent feature models**.
In *29th Conference on Uncertainty in Artificial Intelligence*,
Bellevue, USA, July 2013.

** Abstract:** We propose a
probabilistic model to infer supervised latent variables in the Hamming space
from observed data. Our model allows simultaneous inference of the number of
binary latent variables, and their values. The latent variables preserve
neighbourhood structure of the data in a sense that objects in the same
semantic concept have similar latent values, and objects in different
concepts have dissimilar latent values. We formulate the supervised infinite
latent variable problem based on an intuitive principle of pulling objects
together if they are of the same type, and pushing them apart if they are
not. We then combine this principle with a flexible Indian Buffet Process
prior on the latent variables. We show that the inferred supervised latent
variables can be directly used to perform a nearest neighbour search for the
purpose of retrieval. We introduce a new application of dynamically extending
hash codes, and show how to effectively couple the structure of the hash
codes with continuously growing structure of the neighbourhood preserving
infinite latent feature space.

Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert.
**Learning to rank
using privileged information**.
In *International Conference on Computer Vision*, 2013.

**
Abstract:** Many computer vision problems have an asymmetric distribution
of information between training and test time. In this work, we study the
case where we are given additional information about the training data, which
however will not be available at test time. This situation is called learning
using privileged information (LUPI). We introduce two maximum-margin
techniques that are able to make use of this additional source of
information, and we show that the framework is applicable to several
scenarios that have been studied in computer vision before. Experiments with
attributes, bounding boxes, image tags and rationales as additional
information in object classification show promising results.

Andrew Gordon Wilson and Ryan Prescott Adams.
**Gaussian process covariance
kernels for pattern discovery and extrapolation**.
Technical Report arXiv:1302.4245 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, February 18 2013.

**
Abstract:** Gaussian processes are rich distributions over functions, which
provide a Bayesian nonparametric approach to smoothing and interpolation. We
introduce simple closed form kernels that can be used with Gaussian processes
to discover patterns and enable extrapolation. These kernels are derived by
modelling a spectral density - the Fourier transform of a kernel - with a
Gaussian mixture. The proposed kernels support a broad class of stationary
covariances, but Gaussian process inference remains simple and analytic. We
demonstrate the proposed kernels by discovering patterns and performing long
range extrapolation on synthetic examples, as well as atmospheric CO2 trends
and airline passenger data. We also show that we can reconstruct standard
covariances within our framework.

** Comment:** arXiv:1302.4245

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani.
**Dynamic covariance
models for multivariate financial time series**.
In *30th International Conference on Machine Learning*, Atlanta,
Georgia, USA, June 2013.

** Abstract:** The accurate
prediction of time-changing covariances is an important problem in the
modeling of multivariate financial data. However, some of the most popular
models suffer from a) overfitting problems and multiple local optima, b)
failure to capture shifts in market conditions and c) large computational
costs. To address these problems we introduce a novel dynamic model for
time-changing covariances. Over-fitting and local optima are avoided by
following a Bayesian approach instead of computing point estimates. Changes
in market conditions are captured by assuming a diffusion process in
parameter values, and finally computationally efficient and scalable
inference is performed using particle filters. Experiments with financial
data show excellent performance of the proposed method with respect to
current standard models.

A. Bahramisharif, M. A. J. van Gerven, J-M. Schoffelen, Z. Ghahramani, and
T. Heskes.
**The dynamic
beamformer**.
In G. Langs et al, editor, *Machine Learning in Interpretation of
Neuroimaging (MLINI) 2011 LNAI 7263*, pages 148-155, 2012.

** Abstract:** Beamforming is one of the most commonly used
methods for estimating the active neural sources from the MEG or EEG sensor
readings. The basic assumption in beamforming is that the sources are
uncorrelated, which allows for estimating each source independent of the
others. In this paper, we incorporate the independence assumption of the
standard beamformer in a linear dynamical system, thereby introducing the
dynamic beamformer. Using empirical data, we show that the dynamic beamformer
outperforms the standard beamformer in predicting the condition of interest
which strongly suggests that it also outperforms the standard method in
localizing the active neural generators.

John P. Cunningham, Zoubin Ghahramani, and Carl E. Rasmussen.
**Gaussian
Processes for time-marked time-series data**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** In many settings, data
is collected as multiple time series, where each recorded time series is an
observation of some underlying dynamical process of interest. These
observations are often time-marked with known event times, and one desires to
do a range of standard analyses. When there is only one time marker, one
simply aligns the observations temporally on that marker. When multiple
time-markers are present and are at different times on different time series
observations, these analyses are more difficult. We describe a Gaussian
Process model for analyzing multiple time series with multiple time markings,
and we test it on a variety of data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and
Carl Edward Rasmussen.
**Robust
filtering and smoothing with Gaussian processes**.
*IEEE Transactions on Automatic Control*, 57(7):1865-1871, 2012.

** Abstract:** We propose a principled algorithm for robust
Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when
both the transition function and the measurement function are described by
nonparametric Gaussian process (GP) models. GPs are gaining increasing
importance in signal processing, machine learning, robotics, and control for
representing unknown system functions by posterior probability distributions.
This modern way of "system identification" is more robust than finding point
estimates of a parametric function representation. Our principled
filtering/smoothing approach for GP dynamic systems is based on analytic
moment matching in the context of the forward-backward algorithm. Our
numerical evaluations demonstrate the robustness of the proposed approach in
situations where other state-of-the-art Gaussian filters and smoothers can
fail.

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen.
**Additive Gaussian processes**.
In *Advances in Neural Information Processing Systems 24*, pages
226-234, Granada, Spain, December 2012.

** Abstract:** We
introduce a Gaussian process model of functions which are additive. An
additive function is one which decomposes into a sum of low-dimensional
functions, each depending on only a subset of the input variables. Additive
GPs generalize both Generalized Additive Models, and the standard GP models
which use squared-exponential kernels. Hyperparameter learning in this model
can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an
expressive but tractable parameterization of the kernel function, which
allows efficient evaluation of all input interaction terms, whose number is
exponential in the input dimension. The additional structure discoverable by
this model results in increased interpretability, as well as state-of-the-art
predictive power in regression tasks.

Zoubin Ghahramani.
**Bayesian
nonparametrics and the probabilistic approach to modelling**.
*Philosophical Transactions of the Royal Society A*, 2012.

** Abstract:** Modelling is fundamental to many fields of
science and engineering. A model can be thought of as a representation of
possible data one could predict from a system. The probabilistic approach to
modelling uses probability theory to express all aspects of uncertainty in
the model. The probabilistic approach is synonymous with Bayesian modelling,
which simply uses the rules of probability theory in order to make
predictions, compare alternative models, and learn model parameters and
structure from data. This simple and elegant framework is most powerful when
coupled with flexible probabilistic models. Flexibility is achieved through
the use of Bayesian nonparametrics. This article provides an overview of
probabilistic modelling and an accessible survey of some of the main tools in
Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics
for modelling unknown functions, density estimation, clustering, time series
modelling, and representing sparsity, hierarchies, and covariance structure.
More specifically it gives brief non-technical overviews of Gaussian
processes, Dirichlet processes, infinite hidden Markov models, Indian buffet
processes, Kingman's coalescent, Dirichlet diffusion tress, and Wishart
processes.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Modelling and
control of nonlinear systems using Gaussian processes with partial model
information**.
In *51st IEEE Conference on Decision and Control*, 2012.

**
Abstract:** Gaussian processes are gaining increasing popularity among the
control community, in particular for the modelling of discrete time state
space systems. However, it has not been clear how to incorporate model
information, in the form of known state relationships, when using a Gaussian
process as a predictive model. An obvious example of known prior information
is position and velocity related states. Incorporation of such information
would be beneficial both computationally and for faster dynamics learning.
This paper introduces a method of achieving this, yielding faster dynamics
learning and a reduction in computational effort from O(Dn^{2}) to
O((D-F)n^{2}) in the prediction stage for a system with D states, F
known state relationships and n observations. The effectiveness of the method
is demonstrated through its inclusion in the PILCO learning algorithm with
application to the swing-up and balance of a torque-limited pendulum and the
balancing of a robotic unicycle in simulation.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin
Ghahramani.
**Collaborative Gaussian processes for preference learning**.
In *Advances in Neural Information Processing Systems 26*, pages
2096-2104. Curran Associates, Inc., 2012.

** Abstract:** We
present a new model based on Gaussian processes (GPs) for learning pairwise
preferences expressed by multiple users. Inference is simplified by using a
*preference kernel* for GPs which allows us to combine supervised GP
learning of user preferences with unsupervised dimensionality reduction for
multi-user systems. The model not only exploits collaborative information
from the shared structure in user behavior, but may also incorporate user
features if they are available. Approximate inference is implemented using a
combination of expectation propagation and variational Bayes. Finally, we
present an efficient active learning strategy for querying preferences. The
proposed technique performs favorably on real-world data against
state-of-the-art multi-user preference learning algorithms.

Ferenc Huszár and Neil Houlsby.
**Adaptive Bayesian
quantum tomography**.
*Physical Review A*, 85(5):052120, 2012.

** Abstract:**
In this paper we revisit the problem of optimal design of quantum tomographic
experiments. In contrast to previous approaches where an optimal set of
measurements is decided in advance of the experiment, we allow for
measurements to be adaptively and efficiently re-optimised depending on data
collected so far. We develop an adaptive statistical framework based on
Bayesian inference and Shannon's information, and demonstrate a ten-fold
reduction in the total number of measurements required as compared to
non-adaptive methods, including mutually unbiased bases.

Hyun-Chul Kim and Zoubin Ghahramani.
**Bayesian
classifier combination**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Bayesian model averaging
linearly mixes the probabilistic predictions of multiple models, each
weighted by its posterior probability. This is the coherent Bayesian way of
combining multiple models only under certain restrictive assumptions, which
we outline. We explore a general framework for Bayesian model combination
(which differs from model averaging) in the context of classification. This
framework explicitly models the relationship between each model’s output
and the unknown true label. The framework does not require that the models be
probabilistic (they can even be human assessors), that they share prior
information or receive the same training data, or that they be independent in
their errors. Finally, the Bayesian combiner does not need to believe any of
the models is in fact correct. We test several variants of this classifier
combination procedure starting from a classic statistical model proposed by
Dawid and Skene (1979) and using MCMC to add more complex but important
features to the model. Comparisons on sev- eral data sets to simpler methods
like majority voting show that the Bayesian methods not only perform well but
result in interpretable diagnostics on the data points and the models.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild.
**Bayesian
correlated clustering to integrate multiple datasets**.
*Bioinformatics*, 2012.

** Abstract:** Motivation: The
integration of multiple datasets remains a key challenge in systems biology
and genomic medicine. Modern high-throughput technologies generate a broad
array of different data types, providing distinct – but often complementary
– information. We present a Bayesian method for the unsupervised
integrative modelling of multiple datasets, which we refer to as MDI
(Multiple Dataset Integration). MDI can integrate information from a wide
range of different datasets and data types simultaneously (including the
ability to model time series data explicitly using Gaussian processes). Each
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture
model, with dependencies between these models captured via parameters that
describe the agreement among the datasets.

Results: Using a set of 6
artificially constructed time series datasets, we show that MDI is able to
integrate a significant number of datasets simultaneously, and that it
successfully captures the underlying structural similarity between the
datasets. We also analyse a variety of real S. cerevisiae datasets. In the
2-dataset case, we show that MDI’s performance is comparable to the present
state of the art. We then move beyond the capabilities of current approaches
and integrate gene expression, ChIP-chip and protein-protein interaction
data, to identify a set of protein complexes for which genes are co-regulated
during the cell cycle. Comparisons to other unsupervised data integration
techniques – as well as to non-integrative approaches – demonstrate that
MDI is very competitive, while also providing information that would be
difficult or impossible to extract using other methods.

** Comment:** This paper is available from the Bioinformatics site and a Matlab implementation
of MDI is available fromthis site.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy.
**Random
function priors for exchangeable arrays with applications to graphs and
relational data**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
fundamental problem in the analysis of structured relational data like
graphs, networks, databases, and matrices is to extract a summary of the
common structure underlying relations between individual entities. Relational
data are typically encoded in the form of arrays; invariance to the ordering
of rows and columns corresponds to exchangeable arrays. Results in
probability theory due to Aldous, Hoover and Kallenberg show that
exchangeable arrays can be represented in terms of a random measurable
function which constitutes the natural model parameter in a Bayesian model.
We obtain a flexible yet simple Bayesian nonparametric model by placing a
Gaussian process prior on the parameter function. Efficient inference
utilises elliptical slice sampling combined with a random sparse
approximation to the Gaussian process. We demonstrate applications of the
model to network data and clarify its relation to models in the literature,
several of which emerge as special cases.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf.
**Semi-supervised domain adaptation with non-parametric copulas**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:** A
new framework based on the theory of copulas is proposed to address
semisupervised domain adaptation problems. The presented method factorizes
any multivariate density into a product of marginal distributions and
bivariate copula functions. Therefore, changes in each of these factors can
be detected and corrected to adapt a density model accross different learning
domains. Importantly, we introduce a novel vine copula model, which allows
for this factorization in a non-parametric manner. Experimental results on
regression problems with real-world data illustrate the efficacy of the
proposed approach when compared to state-of-the-art techniques.

Andrew McHutchon and Carl Edward Rasmussen.
**Gaussian
process training with input noise**.
In *Advances in Neural Information Processing Systems 24*, Granada,
Spain, December 2012.

** Abstract:** In standard Gaussian
Process regression input locations are assumed to be noise free. We present a
simple yet effective GP model for training on input points corrupted by
i.i.d. Gaussian noise. To make computations tractable we use a local linear
expansion about each input point. This allows the input noise to be recast as
output noise proportional to the squared gradient of the GP posterior mean.
The input noise hyperparameters are trained alongside other hyperparameters
by the usual method of maximisation of the marginal likelihood, and allow
estimation of the noise levels on each input dimension. Training uses an
iterative scheme, which alternates between optimising the hyperparameters and
calculating the posterior gradient. Analytic predictive moments can then be
found for Gaussian distributed test points. We compare our model to others
over a range of different regression problems and show that it improves over
current methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian and
L _{1} approaches for sparse unsupervised learning**.
In

** Abstract:** The use of L1 regularisation for sparse learning
has generated immense research interest, with many successful applications in
diverse areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many advantages
of L1 methods, in this paper we find that L1 regularisation often
dramatically under-performs in terms of predictive performance when compared
to other methods for inferring sparsity. We focus on unsupervised latent
variable models, and develop L1 minimising factor models, Bayesian variants
of "L1", and Bayesian models with a stronger L0-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian factor models
encourage sparsity while accounting for uncertainty in a principled manner,
and avoid unnecessary shrinkage of non-zero values. We demonstrate on a
number of data sets that in practice spike-and-slab Bayesian methods
out-perform L1 minimisation, even on a com- putational budget. We thus
highlight the need to re-assess the wide use of L1 methods in
sparsity-reliant applications, particularly when we care about generalising
to previously unseen data, and provide an alternative that, over many varying
conditions, provides improved generalisation performance.

Donglin Niu, Jennifer G. Dy, and Z. Ghahramani.
**A nonparametric
Bayesian model for multiple clustering with overlapping feature
views**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** Most clustering
algorithms produce a single clustering solution. This is inadequate for many
data sets that are multi-faceted and can be grouped and interpreted in many
different ways. Moreover, for high-dimensional data, different features may
be relevant or irrelevant to each clustering solution, suggesting the need
for feature selection in clustering. Features relevant to one clustering
interpretation may be different from the ones relevant for an alternative
interpretation or view of the data. In this paper, we introduce a
probabilistic nonparametric Bayesian model that can discover multiple
clustering solutions from data and the feature subsets that are relevant for
the clusters in each view. In our model, the features in different views may
be shared and therefore the sets of relevant features are allowed to overlap.
We model feature relevance to each view using an Indian Buffet Process and
the cluster membership in each view using a Chinese Restaurant Process. We
provide an inference approach to learn the latent parameters corresponding to
this multiple partitioning problem. Our model not only learns the features
and clusters in each view but also automatically learns the number of
clusters, number of views and number of features in each view.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**An infinite latent attribute
model for network data**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** Latent variable models for
network data extract a summary of the relational structure underlying an
observed network. The simplest possible models subdivide nodes of the network
into clusters; the probability of a link between any two nodes then depends
only on their cluster assignment. Currently available models can be
classified by whether clusters are disjoint or are allowed to overlap. These
models can explain a "flat" clustering structure. Hierarchical Bayesian
models provide a natural approach to capture more complex dependencies. We
propose a model in which objects are characterised by a latent feature
vector. Each feature is itself partitioned into disjoint groups
(subclusters), corresponding to a second layer of hierarchy. In experimental
comparisons, the model achieves significantly improved predictive performance
on social and biological link prediction tasks. The results indicate that
models with a single layer hierarchy over-simplify real networks.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani.
**A
nonparametric variable clustering model**.
In *Advances in Neural Information Processing Systems 26*, pages 1-9,
Lake Tahoe, California, USA, December 2012.

** Abstract:**
Factor analysis models effectively summarise the covariance structure of high
dimensional data, but the solutions are typically hard to interpret. This
motivates attempting to find a disjoint partition, i.e. a simple clustering,
of observed variables into highly correlated subsets. We introduce a Bayesian
non-parametric approach to this problem, and demonstrate advantages over
heuristic methods proposed to date. Our Dirichlet process variable clustering
(DPVC) model can discover block-diagonal covariance structures in data. We
evaluate our method on both synthetic and gene expression analysis
problems.

Barnabas Poczos, Zoubin Ghahramani, and Jeff Schneider.
**Copula-based
kernel dependency measures**.
In *29th International Conference on Machine Learning*, 2012.

** Abstract:** The paper presents a new copula based method for
measuring dependence between random variables. Our approach extends the
Maximum Mean Discrepancy to the copula of the joint distribution. We prove
that this approach has several advantageous properties. Similarly to Shannon
mutual information, the proposed dependence measure is invariant to any
strictly increasing transformation of the marginal variables. This is
important in many applications, for example in feature selection. The
estimator is consistent, robust to outliers, and uses rank statistics only.
We derive upper bounds on the convergence rate and propose independence tests
too. We illustrate the theoretical contributions through a series of
experiments in feature selection and low-dimensional embedding of
distributions.

Novi Quadrianto, Chao Chen, and Christoph Lampert.
**The most
persistent soft-clique in a set of sampled graphs**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** When searching for
characteristic subpatterns in potentially noisy graph data, it appears
self-evident that having multiple observations would be better than having
just one. However, it turns out that the inconsistencies introduced when
different graph instances have different edge sets pose a serious challenge.
In this work we address this challenge for the problem of finding maximum
weighted cliques. We introduce the concept of most persistent soft-clique.
This is subset of vertices, that 1) is almost fully or at least densely
connected, 2) occurs in all or almost all graph instances, and 3) has the
maximum weight. We present a measure of clique-ness, that essentially counts
the number of edge missing to make a subset of vertices into a clique. With
this measure, we show that the problem of finding the most persistent
soft-clique problem can be cast either as: a) a max-min two person game
optimization problem, or b) a min-min soft margin optimization problem. Both
formulations lead to the same solution when using a partial Lagrangian method
to solve the optimization problems. By experiments on synthetic data and on
real social network data we show that the proposed method is able to reliably
find soft cliques in graph data, even if that is distorted by random noise or
unreliable observations.

Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert.
**Augmented
attributes representations**.
In *12th European Conference on Computer Vision*, pages 242-255,
2012.

** Abstract:** We propose a new learning method to infer
a mid-level feature representation that combines the advantage of semantic
attribute representations with the higher expressive power of non-semantic
features. The idea lies in augmenting an existing attribute-based
representation with additional dimensions for which an autoencoder model is
coupled with a large-margin principle. This construction allows a smooth
transition between the zero-shot regime with no training example, the
unsupervised regime with training examples but without class labels, and the
supervised regime with training examples and with class labels. The resulting
optimization problem can be solved efficiently, because several of the
necessity steps have closed-form solutions. Through extensive experiments we
show that the augmented representation achieves better results in terms of
object categorization accuracy than the semantic representation alone.

Kyung-Ah Sohn, Zoubin Ghahramani, and Eric P. Xing.
**Robust estimation
of local genetic ancestry in admixed populations using a non-parametric
Bayesian approach**.
*Genetics*, 191(4), 2012.

** Abstract:** We present a new
haplotype-based approach for inferring local genetic ancestry of individuals
in an admixed population. Most existing approaches for local ancestry
estimation ignore the latent genetic relatedness between ancestral
populations and treat them as independent. In this paper, we exploit such
information by building an inheritance model that describes both the
ancestral populations and the admixed population jointly in a unified
framework. Based on an assumption that the common hypothetical founder
haplotypes give rise to both the ancestral and admixed population haplotypes,
we employ an infinite hidden Markov model to characterize each ancestral
population and further extend it to generate the admixed population. Through
an effective utilization of the population structural information under a
principled nonparametric Bayesian framework, the resulting model is
significantly less sensitive to the choice and the amount of training data
for ancestral populations than state-of-the-arts algorithms. We also improve
the robustness under deviation from common modeling assumptions by
incorporating population-specific scale parameters that allow variable
recombination rates in different populations. Our method is applicable to an
admixed population from an arbitrary number of ancestral populations and also
performs competitively in terms of spurious ancestry proportions under
general multi-way admixture assumption. We validate the proposed method by
simulation under various admixing scenarios and present empirical analysis
results on worldwide distributed dataset from Human Genome Diversity
Project.

** Comment:** doi: 10.1534/genetics.112.140228

Jacob Steinhardt and Zoubin Ghahramani.
**Flexible martingale
priors for deep hierarchies**.
In *15th International Conference on Artificial Intelligence and
Statistics*, 2012.

** Abstract:** When building priors
over trees for Bayesian hierarchical models, there is a tension between
maintaining desirable theoretical properties such as infinite exchangeability
and important practical properties such as the ability to increase the depth
of the tree to accommodate new data. We resolve this tension by presenting a
family of infinitely exchangeable priors over discrete tree structures that
allows the depth of the tree to grow with the data, and then showing that our
family contains all hierarchical models with certain mild symmetry
properties. We also show that deep hierarchical models are in general
intimately tied to a process called a martingale, and use Doob’s martingale
convergence theorem to demonstrate some unexpected properties of deep
hierarchies.

Tatiana Tommasi, Novi Quadrianto, Barbara Caputo, and Christoph Lampert.
**Beyond dataset
bias: Multi-task unaligned shared knowledge transfer**.
In *11th Asian Conference on Computer Vision*, 2012.

**
Abstract:** Many visual datasets are traditionally used to analyze the
performance of different learning techniques. The evaluation is usually done
within each dataset, therefore it is questionable if such results are a
reliable indicator of true generalization ability. We propose here an
algorithm to exploit the existing data resources when learning on a new
multiclass problem. Our main idea is to identify an image representation that
decomposes orthogonally into two subspaces: a part specific to each dataset,
and a part generic to, and therefore shared between, all the considered
source sets. This allows us to use the generic representation as un-biased
reference knowledge for a novel classification task. By casting the method in
the multi-view setting, we also make it possible to use different features
for different databases. We call the algorithm MUST, Multitask Unaligned
Shared knowledge Transfer. Through extensive experiments on five public
datasets, we show that MUST consistently improves the cross-datasets
generalization performance.

Ryan D. Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
*Neurocomputing*, 80:47-53, 2012, doi
10.1016/j.neucom.2011.07.029.

** Abstract:** The unscented
Kalman filter (UKF) is a widely used method in control and time series
applications. The UKF suffers from arbitrary parameters necessary for sigma
point placement, potentially causing it to perform poorly in nonlinear
problems. We show how to treat sigma point placement in a UKF as a learning
problem in a model based view. We demonstrate that learning to place the
sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Richard E. Turner and Maneesh Sahani.
**Decomposing signals into a sum of amplitude and frequency
modulated sinusoids using probabilistic inference**.
In *Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on*, pages 2173-2176, march 2012, doi
10.1109/ICASSP.2012.6288343.

** Abstract:** There are many
methods for decomposing signals into a sum of amplitude and frequency
modulated sinusoids. In this paper we take a new estimation based approach.
Identifying the problem as ill-posed, we show how to regularize the solution
by imposing soft constraints on the amplitude and phase variables of the
sinusoids. Estimation proceeds using a version of Kalman smoothing. We
evaluate the method on synthetic and natural, clean and noisy signals,
showing that it outperforms previous decompositions, but at a higher
computational cost.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani.
**Gaussian process regression
networks**.
In *29th International Conference on Machine Learning*, Edinburgh,
Scotland, June 2012.

** Abstract:** We introduce a new
regression framework, Gaussian process regression networks (GPRN), which
combines the structural properties of Bayesian neural networks with the
nonparametric flexibility of Gaussian processes. GPRN accommodates input
(predictor) dependent signal and noise correlations between multiple output
(response) variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both elliptical slice
sampling and variational Bayes inference procedures for GPRN. We apply GPRN
as a multiple output regression and multivariate volatility model,
demonstrating substantially improved performance over eight popular multiple
output (multi-task) Gaussian process models and three multivariate volatility
models on real datasets, including a 1000 dimensional gene expression
dataset.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths.
**Testing a
Bayesian measure of representativeness using a large image
database**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** How do people
determine which elements of a set are most representative of that set? We
extend an existing Bayesian measure of representativeness, which indicates
the representativeness of a sample from a distribution, to deﬁne a measure
of the representativeness of an item to a set. We show that this measure is
formally related to a machine learning method known as Bayesian Sets.
Building on this connection, we derive an analytic expression for the
representativeness of objects described by a sparse vector of binary
features. We then apply this measure to a large database of images, using it
to determine which images are the most representative members of different
sets. Comparing the resulting predictions to human judgments of
representativeness provides a test of this measure with naturalistic stimuli,
and illustrates how databases that are more commonly used in computer vision
and machine learning can be used to evaluate psychological theories.

Daniel A. Braun, Pedro A. Ortega, Evangelos Theodorou, and Stefan Schaal.
**Path integral
control and bounded rationality**.
In *2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement
Learning*, 2011.

** Abstract:** Path integral methods have
recently been shown to be applicable to a very general class of optimal
control problems. Here we examine the path integral formalism from a
decision-theoretic point of view, since an optimal controller can always be
regarded as an instance of a perfectly rational decision-maker that chooses
its actions so as to maximize its expected utility. The problem with perfect
rationality is, however, that finding optimal actions is often very difficult
due to prohibitive computational resource costs that are not taken into
account. In contrast, a bounded rational decision-maker has only limited
resources and therefore needs to strike some compromise between the desired
utility and the required resource costs. In particular, we suggest an
information-theoretic measure of resource costs that can be derived
axiomatically. As a consequence we obtain a variational principle for choice
probabilities that trades off maximizing a given utility criterion and
avoiding resource costs that arise due to deviating from initially given
default choice probabilities. The resulting bounded rational policies are in
general probabilistic. We show that the solutions found by the path integral
formalism are such bounded rational policies. Furthermore, we show that the
same formalism generalizes to discrete control problems, leading to linearly
solvable bounded rational control policies in the case of Markov systems.
Importantly, Bellman's optimality principle is not presupposed by this
variational principle, but it can be derived as a limit case. This suggests
that the information theoretic formalization of bounded rationality might
serve as a general principle in control design that unifies a number of
recently reported approximate optimal control methods both in the continuous
and discrete domain.

Daniel A. Braun, Pedro A. Ortega, and Daniel M. Wolpert.
**Motor coordination:
When two have to act as one**.
*Special issue of Experimental Brain Research on Joint Action*, 2011.

** Abstract:** Trying to pass someone walking toward you in a
narrow corridor is a familiar example of a two-person motor game that
requires coordination. In this study, we investigate coordination in
sensorimotor tasks that correspond to classic coordination games with
multiple Nash equilibria, such as "choosing sides", "stag hunt", "chicken",
and "battle of sexes". In these tasks, subjects made reaching movements
reflecting their continuously evolving "decisions" while they received a
continuous payoff in the form of a resistive force counteracting their
movements. Successful coordination required two subjects to "choose" the same
Nash equilibrium in this force-payoff landscape within a single reach. We
found that on the majority of trials coordination was achieved. Compared to
the proportion of trials in which miscoordination occurred, successful
coordination was characterized by several distinct features: an increased
mutual information between the players' movement endpoints, an increased
joint entropy during the movements, and by differences in the timing of the
players' responses. Moreover, we found that the probability of successful
coordination depends on the players' initial distance from the Nash
equilibria. Our results suggest that two-person coordination arises naturally
in motor interactions and is facilitated by favorable initial positions,
stereotypical motor pattern, and differences in response times.

J. P. Cunningham, P. Nuyujukian, V. Gilja, C. A. Chestek, S. I. Ryu, and K. V.
Shenoy.
**A
closed-loop human simulator for investigating the role of feedback-control in
brain-machine interfaces**.
*Journal of Neurophysiology*, 105:1932-1949, 2011.

**
Abstract:** Neural prosthetic systems seek to improve the lives of severely
disabled people by decoding neural activity into useful behavioral commands.
These systems and their decoding algorithms are typically developed
"offline", using neural activity previously gathered from a healthy animal,
and the decoded movement is then compared with the true movement that
accompanied the recorded neural activity. However, this offline design and
testing may neglect important features of a real prosthesis, most notably the
critical role of feedback control, which enables the user to adjust neural
activity while using the prosthesis. We hypothesize that under- standing and
optimally designing high-performance decoders require an experimental
platform where humans are in closed-loop with the various candidate decode
systems and algorithms. It remains unexplored the extent to which the subject
can, for a particular decode system, algorithm, or parameter, engage feedback
and other strategies to improve decode performance. Closed-loop testing may
suggest different choices than offline analyses. Here we ask if a healthy
human subject, using a closed-loop neural prosthesis driven by synthetic
neural activity, can inform system design. We use this online pros- thesis
simulator (OPS) to optimize "online" decode performance based on a key
parameter of a current state-of-the-art decode algorithm, the bin width of a
Kalman filter. First, we show that offline and online analyses indeed suggest
different parameter choices. Previous literature and our offline analyses
agree that neural activity should be analyzed in bins of 100- to 300-ms
width. OPS analysis, which incorporates feedback control, suggests that much
shorter bin widths (25-50 ms) yield higher decode performance. Second, we
confirm this surprising finding using a closed-loop rhesus monkey prosthetic
system. These findings illustrate the type of discovery made possible by the
OPS, and so we hypothesize that this novel testing approach will help in the
design of prosthetic systems that will translate well to human patients.

A. Davies and Z. Ghahramani.
**Language-independent
Bayesian sentiment mining of twitter**.
In *In The Fifth Workshop on Social Network Mining and Analysis
(SNA-KDD 2011)*, August 2011.

** Abstract:** This paper
outlines a new language-independent model for sentiment analysis of short,
social-network statuses. We demonstrate this on data from Twitter, modelling
happy vs sad sentiment, and show that in some circumstances this outperforms
similar Naive Bayes models by more than 10%. We also propose an extension to
allow the modelling of differ- ent sentiment distributions in different
geographic regions, while incorporating information from neighbouring
regions. We outline the considerations when creating a system analysing
Twitter data and present a scalable system of data acquisi- tion and
prediction that can monitor the sentiment of tweets in real time.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**PILCO: A
model-based and data-efficient approach to policy search**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** In this paper, we introduce PILCO, a practical,
data-efficient model-based policy search method. PILCO reduces model bias,
one of the key problems of model-based reinforcement learning, in a
principled way. By learning a probabilistic dynamics model and explicitly
incorporating model uncertainty into long-term planning, PILCO can cope with
very little data and facilitates learning from scratch in only a few trials.
Policy evaluation is performed in closed form using state-of-the-art
approximate inference. Furthermore, policy gradients are computed
analytically for policy improvement. We report unprecedented learning
efficiency on challenging and high-dimensional control tasks.

** Comment:** web
site

Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox.
**Learning to
control a low-cost manipulator using data-efficient reinforcement
learning**.
In *9th International Conference on Robotics: Science & Systems*, Los
Angeles, CA, USA, June 2011.

** Abstract:** Over the last
years, there has been substantial progress in robust manipulation in
unstructured environments. The long-term goal of our work is to get away from
precise, but very expensive robotic systems and to develop affordable,
potentially imprecise, self-adaptive manipulator systems that can
interactively perform tasks such as playing with children. In this paper, we
demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop
policies for a stacking task in only a handful of trials - from scratch. Our
manipulator is inaccurate and provides no pose feedback. For learning a
controller in the work space of a Kinect-style depth camera, we use a
model-based reinforcement learning technique. Our learning method is data
efficient, reduces model bias, and deals with several noise sources in a
principled way during long-term planning. We present a way of incorporating
state-space constraints into the learning process and analyze the learning
gain by exploiting the sequential structure of the stacking task.

** Comment:** project site

Finale Doshi-Velez and Zoubin Ghahramani.
**A comparison of
human and agent reinforcement learning in partially observable
domains**.
In *33rd Annual Meeting of the Cognitive Science Society*, Boston, MA,
2011.

** Abstract:** It is commonly stated that reinforcement
learning (RL) algorithms learn slower than humans. In this work, we
investigate this claim using two standard problems from the RL literature. We
compare the performance of human subjects to RL techniques. We find that
context-the meaningfulness of the observations—-plays a significant role
in the rate of human RL. Moreover, without contextual information, humans
often fare much worse than classic algorithms. Comparing the detailed
responses of humans and RL algorithms, we also find that humans appear to
employ rather different strategies from standard algorithms, even in cases
where they had indistinguishable performance to them. Our research both sheds
light on human RL and provides insights for improving RL algorithms.

Thomas L. Griffiths and Zoubin Ghahramani.
**The Indian buffet
process: An introduction and review**.
*Journal of Machine Learning Research*, 12:1185-1224, April 2011.

** Abstract:** The Indian buffet process is a stochastic process
defining a probability distribution over equivalence classes of sparse binary
matrices with a finite number of rows and an unbounded number of columns.
This distribution is suitable for use as a prior in probabilistic models that
represent objects using a potentially infinite array of features, or that
involve bipartite graphs in which the size of at least one class of nodes is
unknown. We give a detailed derivation of this distribution, and illustrate
its use as a prior in an infinite latent feature model. We then review recent
applications of the Indian buffet process in machine learning, discuss its
extensions, and summarize its connections to other stochastic processes.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski.
**Reinforcement
learning with reference tracking control in continuous state spaces**.
In *Proceedings of 50th IEEE Conference on Decision and Control and European
Control Conference*, 2011.

** Abstract:** The contribution
described in this paper is an algorithm for learning nonlinear, reference
tracking, control policies given no prior knowledge of the dynamical system
and limited interaction with the system through the learning process.
Concepts from the field of reinforcement learning, Bayesian statistics and
classical control have been brought together in the formulation of this
algorithm which can be viewed as a form indirect self tuning regulator. On
the task of reference tracking using the inverted pendulum it was shown to
yield generally improved performance on the best controller derived from the
standard linear quadratic method using only 30 s of total interaction with
the system. Finally, the algorithm was shown to work on the double pendulum
proving its ability to solve nontrivial control tasks.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont.
**Robust
multi-class Gaussian process classification**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Multi-class Gaussian Processs Classifiers (MGPCs)
are often affected by overfitting problems when labeling errors occur far
from the decision boundaries. To prevent this, we investigate a robust MGPC
(RMGPC) which considers labeling errors independently of their distance to
the decision boundaries. Expectation propagation is used for approximate
inference. Experiments with several datasets in which noise is injected in
the labels illustrate the benefits of RMGPC. This method performs better than
other Gaussian process alternatives based on considering latent Gaussian
noise or heavy-tailed processes. When no noise is injected in the labels,
RMGPC still performs equal or better than the other methods. Finally, we show
how RMGPC can be used for successfully indentifying data instances which are
difficult to classify correctly in practice.

Ferenc Huszár and Simon Lacoste-Julien.
**A kernel approach to
tractable Bayesian nonparametrics**.
Technical report, University of Cambridge, 2011.

** Abstract:**
Inference in popular nonparametric Bayesian models typically relies on
sampling or other approximations. This paper presents a general methodology
for constructing novel tractable nonparametric Bayesian methods by applying
the kernel trick to inference in a parametric Bayesian model. For example,
Gaussian process regression can be derived this way from Bayesian linear
regression. Despite the success of the Gaussian process framework, the kernel
trick is rarely explicitly considered in the Bayesian literature. In this
paper, we aim to fill this gap and demonstrate the potential of applying the
kernel trick to tractable Bayesian parametric models in a wider context than
just regression. As an example, we present an intuitive Bayesian kernel
machine for density estimation that is obtained by applying the kernel trick
to a Gaussian generative model in feature space.

** Comment:** arXiv:1103.1761

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani.
**Message passing
algorithms for the Dirichlet diffusion tree**.
In *28th International Conference on Machine Learning*, 2011.

** Abstract:** We demonstrate efficient approximate inference
for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior
over tree structures. Although DDTs provide a powerful and elegant approach
for modeling hierarchies they haven't seen much use to date. One problem is
the computational cost of MCMC inference. We provide the first deterministic
approximate inference methods for DDT models and show excellent performance
compared to the MCMC alternative. We present message passing algorithms to
approximate the Bayesian model evidence for a specific tree. This is used to
drive sequential tree building and greedy search to find optimal tree
structures, corresponding to hierarchical clusterings of the data. We
demonstrate appropriate observation models for continuous and binary data.
The empirical performance of our method is very close to the computationally
expensive MCMC alternative on a density estimation problem, and significantly
outperforms kernel density estimators.

** Comment:** web site

David A. Knowles and Zoubin Ghahramani.
**Pitman-Yor
diffusion trees**.
In *27nd Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce the Pitman Yor Diffusion Tree (PYDT)
for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree
(Neal, 2001) which removes the restriction to binary branching structure. The
generative process is described and shown to result in an exchangeable
distribution over data points. We prove some theoretical properties of the
model and then present two inference methods: a collapsed MCMC sampler which
allows us to model uncertainty over tree structures, and a computationally
efficient greedy Bayesian EM search algorithm. Both algorithms use message
passing on the tree structure. The utility of the model and algorithms is
demonstrated on synthetic and real world data, both continuous and
binary.

** Comment:** web site

David A. Knowles and Zoubin Ghahramani.
**Nonparametric
Bayesian sparse factor models with application to gene expression
modelling.**.
*Annals of Applied Statistics*, 5(2B):1534-1552, 2011.

**
Abstract:** A nonparametric Bayesian extension of Factor Analysis (FA) is
proposed where observed data Y is modeled as a linear superposition, G, of a
potentially infinite number of hidden factors, X. The Indian Buffet Process
(IBP) is used as a prior on G to incorporate sparsity and to allow the number
of latent features to be inferred. The model's utility for modeling gene
expression data is investigated using randomly generated data sets based on a
known sparse connectivity matrix for E. Coli, and on three biological data
sets of increasing complexity.

David A. Knowles and Thomas P. Minka.
**Non-conjugate
variational message passing for multinomial and binary regression**.
In *Advances in Neural Information Processing Systems 25*, 2011.

** Abstract:** Variational Message Passing (VMP) is an
algorithmic implementation of the Variational Bayes (VB) method which applies
only in the special case of conjugate exponential family models. We propose
an extension to VMP, which we refer to as Non-conjugate Variational Message
Passing (NCVMP) which aims to alleviate this restriction while maintaining
modularity, allowing choice in how expectations are calculated, and
integrating into an existing message-passing framework: Infer.NET. We
demonstrate NCVMP on logistic binary and multinomial regression. In the
multinomial case we introduce a novel variational bound for the softmax
factor which is tighter than other commonly used bounds whilst maintaining
computational tractability.

** Comment:** web site supplementary

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Inferring a
measure of physiological age from multiple ageing related phenotypes**.
In *NIPS Workshop: From Statistical Genetics to Predictive Models in
Personalized Medicine*, 2011.

** Abstract:** What is
ageing? One definition is simultaneous degradation of multiple organ systems.
Can an individual be said to be "old" or "young" for their (chronological)
age in a scientifically meaningful way? We investigate these questions using
ageing related phenotypes measured on the 12,000 female twins in the Twins UK
study. We propose a simple linear model of ageing, which allows a latent
adjustment to be made to an individual's chronological age to give her
"physiological age", shared across the observed phenotypes. We note problems
with the analysis resulting from the linearity assumption and show how to
alleviate these issues using a non-linear extension. We find more gene
expression probes are significantly associated with our measurement of
physiological age than to chronological age.

** Comment:** web site

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani.
**Approximate
inference for the loss-calibrated Bayesian**.
In Geoff Gordon and David Dunson, editors, *14th International Conference on
Artificial Intelligence and Statistics*, volume 15, pages 416-424,
Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

** Abstract:** We consider the problem of approximate inference
in the context of Bayesian decision theory. Traditional approaches focus on
approximating general properties of the posterior, ignoring the decision task
- and associated losses - for which the posterior could be used. We argue
that this can be suboptimal and propose instead to *loss-calibrate* the
approximate inference methods with respect to the decision task at hand. We
present a general framework rooted in Bayesian decision theory to analyze
approximate inference from the perspective of losses, opening up several
research directions. As a first loss-calibrated approximate inference
attempt, we propose an EM-like algorithm on the Bayesian posterior risk and
show how it can improve a standard approach to Gaussian process
classification when losses are asymmetric.

J. H. Macke, L. Busing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and
M. Sahani.
**Empirical
models of spiking in neural populations**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Neurons in the
neocortex code and compute as part of a locally interconnected population.
Large-scale multi-electrode recording makes it possible to access these
population processes empirically by fitting statistical models to unaveraged
data. What statistical structure best describes the concurrent spiking of
cells within a local network? We argue that in the cortex, where firing
exhibits extensive correlations in both time and space and where a typical
sample of neurons still reflects only a very small fraction of the local
population, the most appropriate model captures shared variability by a
low-dimensional latent process evolving with smooth dynamics, rather than by
putative direct coupling. We test this claim by comparing a latent dynamical
model with realistic spiking observations to coupled generalised linear
spike-response models (GLMs) using cortical recordings. We find that the
latent dynamical approach outperforms the GLM in terms of goodness-of- fit,
and reproduces the temporal correlations in the data more accurately. We also
compare models whose observations models are either derived from a Gaussian
or point-process models, finding that the non-Gaussian model provides
slightly better goodness-of-fit and more realistic population spike
counts.

Shakir Mohamed.
**Generalised Bayesian
matrix factorisation models**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** Factor analysis and related models for
probabilistic matrix factorisation are of central importance to the
unsupervised analysis of data, with a colourful history more than a century
long. Probabilistic models for matrix factorisation allow us to explore the
underlying structure in data, and have relevance in a vast number of
application areas including collaborative filtering, source separation,
missing data imputation, gene expression analysis, information retrieval,
computational finance and computer vision, amongst others.

This thesis
develops generalisations of matrix factorisation models that advance our
understanding and enhance the applicability of this important class of
models. The generalisation of models for matrix factorisation focuses on
three concerns: widening the applicability of latent variable models to the
diverse types of data that are currently available; considering alternative
structural forms in the underlying representations that are inferred; and
including higher order data structures into the matrix factorisation
framework. These three issues reflect the reality of modern data analysis and
we develop new models that allow for a principled exploration and use of data
in these settings. We place emphasis on Bayesian approaches to learning and
the advantages that come with the Bayesian methodology. Our port of departure
is a generalisation of latent variable models to members of the exponential
family of distributions. This generalisation allows for the analysis of data
that may be real-valued, binary, counts, non-negative or a heterogeneous set
of these data types. The model unifies various existing models and constructs
for unsupervised settings, the complementary framework to the generalised
linear models in regression.

Moving to structural considerations, we
develop Bayesian methods for learning sparse latent representations. We
define ideas of weakly and strongly sparse vectors and investigate the
classes of prior distributions that give rise to these forms of sparsity,
namely the scale-mixture of Gaussians and the spike-and-slab distribution.
Based on these sparsity favouring priors, we develop and compare methods for
sparse matrix factorisation and present the first comparison of these sparse
learning approaches. As a second structural consideration, we develop models
with the ability to generate correlated binary vectors. Moment-matching is
used to allow binary data with specified correlation to be generated, based
on dichotomisation of the Gaussian distribution. We then develop a novel and
simple method for binary PCA based on Gaussian dichotomisation. The third
generalisation considers the extension of matrix factorisation models to
multi-dimensional arrays of data that are increasingly prevalent. We develop
the first Bayesian model for non-negative tensor factorisation and explore
the relationship between this model and the previously described models for
matrix factorisation.

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider,
Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin
Goddard, Pietro Lio, Martin Bennett, and Roger Foo.
**Distinct epigenomic features in human
cardiomyopathy**.
*Circulation, American Heart Association*, 2011.

**
Abstract:** Background. The epigenome refers to marks on the genome
including DNA methylation and histone modifications that regulate the
expression of underlying genes. A consistent profile of gene expression
changes in end- stage cardiomyopathy led us to hypothesise that distinct
global patterns of the epigenome may also exist. Methods and Results. We
constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36
tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human
hearts. 506Mb of sequence per library was generated by high-throughput
sequencing, covering 24 million out of the 28 million CG di-nucleotides in
the human genome. DNA methylation was significantly different in promoter
CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions
of the genome. Moreover DNA methylation differences were present in promoters
of upregulated genes but not down-regulated genes. The profile of
H3K36me3-enrichment itself was also significantly different in protein-coding
regions of the genome. Conclusions. Distinct epigenomic patterns exist in
important DNA elements of the human cardiac genome in end-stage
cardiomyopathy. If epigenomic patterns track with disease progression, assays
for the epigenome may be more useful than quantification of mRNA for
assessing prognosis in heart failure. These results open up an important new
horizon of research and further studies will be needed to determine how
epigenomics contribute to altered gene expression in cardiomyopathy.

Peter Orbanz.
**Projective limit
random probabilities on Polish spaces**.
*Electron. J. Stat.*, 5:1354-1373, 2011.

**
Abstract:** A pivotal problem in Bayesian nonparametrics is the
construction of prior distributions on the space M(V) of probability measures
on a given domain V. In principle, such distributions on the
infinite-dimensional space M(V) can be constructed from their
finite-dimensional marginals—the most prominent example being the
construction of the Dirichlet process from finite-dimensional Dirichlet
distributions. This approach is both intuitive and applicable to the
construction of arbitrary distributions on M(V), but also hamstrung by a
number of technical difficulties. We show how these difficulties can be
resolved if the domain V is a Polish topological space, and give a
representation theorem directly applicable to the construction of any
probability distribution on M(V) whose first moment measure is well-defined.
The proof draws on a projective limit theorem of Bochner, and on properties
of set functions on Polish spaces to establish countable additivity of the
resulting random probabilities.

Pedro A. Ortega.
**A Unified Framework for
Resource-Bounded Agents Interacting with an Unknown Environment**.
PhD thesis, Department of Engineering, University of Cambridge, 2011.

** Abstract:** The aim of this thesis is to present a
mathematical framework for conceptualizing and constructing adaptive
autonomous systems under resource constraints. The first part of this thesis
contains a concise presentation of the foundations of classical agency:
namely the formalizations of decision making and learning. Decision making
includes: (a) subjective expected utility (SEU) theory, the framework of
decision making under uncertainty; (b) the maximum SEU principle to choose
the optimal solution; and (c) its application to the design of autonomous
systems, culminating in the Bellman optimality equations. Learning includes:
(a) Bayesian probability theory, the theory for reasoning under uncertainty
that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian
probability theory to the design of optimal adaptive agents. Then, two major
problems of the maximum SEU principle are highlighted: (a) the prohibitive
computational costs and (b) the need for the causal precedence of the choice
of the policy. The second part of this thesis tackles the two aforementioned
problems. First, an information-theoretic notion of resources in autonomous
systems is established. Second, a framework for resource-bounded agency is
introduced. This includes: (a) a maximum bounded SEU principle that is
derived from a set of axioms of utility; (b) an axiomatic model of
probabilistic causality, which is applied for the formalization of autonomous
systems having uncertainty over their policy and environment; and (c) the
Bayesian control rule, which is derived from the maximum bounded SEU
principle and the model of causality, implementing a stochastic adaptive
control law that deals with the case where autonomous agents are uncertain
about their policy and environment.

Pedro A. Ortega and Daniel A. Braun.
**Information, utility and bounded rationality**.
In *The fourth conference on artificial general intelligence*, volume
6830 of *Lecture Notes on Artificial Intelligence*, pages 269-274.
Springer-Verlag, 2011.

** Abstract:** Perfectly rational
decision-makers maximize expected utility, but crucially ignore the resource
costs incurred when determining optimal actions. Here we employ an axiomatic
framework for bounded rational decision-making based on a thermodynamic
interpretation of resource costs as information costs. This leads to a
variational free utility principle akin to thermodynamical free energy that
trades off utility and information costs. We show that bounded optimal
control solutions can be derived from this variational principle, which leads
in general to stochastic policies. Furthermore, we show that risk-sensitive
and robust (minimax) control schemes fall out naturally from this framework
if the environment is considered as a bounded rational and perfectly rational
opponent, respectively. When resource costs are ignored, the maximum expected
utility principle is recovered.

Pedro A. Ortega, Daniel A. Braun, and Simon Godsill.
**Reinforcement learning
and the Bayesian control rule**.
In *The fourth conference on artificial general intelligence*, volume
6830 of *Lecture Notes on Artificial Intelligence*, pages 281-285.
Springer-Verlag, 2011.

** Abstract:** We present an
actor-critic scheme for reinforcement learning in complex domains. The main
contribution is to show that planning and I/O dynamics can be separated such
that an intractable planning problem reduces to a simple multi-armed bandit
problem, where each lever stands for a potentially arbitrarily complex
policy. Furthermore, we use the Bayesian control rule to construct an
adaptive bandit player that is universal with respect to a given class of
optimal bandit players, thus indirectly constructing an adaptive agent that
is universal with respect to a given class of policies.

B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy,
and M. Sahani.
**Dynamical
segmentation of single trials from population neural data**.
In *Advances in Neural Information Processing Systems 25*, pages 1-8,
Granada, Spain, December 2011.

** Abstract:** Simultaneous
recordings of many neurons embedded within a recurrently-connected cortical
network may provide concurrent views into the dynamical processes of that
network, and thus its computational function. In principle, these dynamics
might be identified by purely unsupervised, statistical means. Here, we show
that a Hidden Switching Linear Dynamical Systems (HSLDS) model - in which
multiple linear dynamical laws approximate and nonlinear and potentially
non-stationary dynamical process - is able to distinguish dynamical regimes
within single-trial motor cortical activity associated with the preparation
and initiation of hand movements. The regimes are identified without
reference to behavioural or experimental epochs, but nonetheless transitions
between them correlate strongly with external events whose timing may vary
from trial to trial. The HSLDS model also performs better than recent
comparable models in predicting the firing rate of an isolated neuron based
on the firing rates of others, suggesting that it captures more of the
"Shared variance" of the data. Thus, the method is able to trace the
dynamical processes underlying the coordinated evolution of network activity
in a way that appears to reflect its computational role.

Daniel M. Roy.
**On the computability
and complexity of Bayesian reasoning**.
In *NIPS Workshop on Philosophy and Machine Learning*, 2011.

** Abstract:** If we consider the claim made by some cognitive
scientists that the mind performs Bayesian reasoning, and if we
simultaneously accept the Physical Church-Turing thesis and thus believe that
the computational power of the mind is no more than that of a Turing machine,
then what limitations are there to the reasoning abilities of the mind? I
give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics)
and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity
of Bayesian reasoning. In particular, we prove that conditional probability
is in general not computable in the presence of continuous random variables.
However, in light of additional structure in the prior distribution, such as
the presence of certain types of noise, or of exchangeability, conditioning
is possible. These results cover most of statistical practice. At the
workshop on Logic and Computational Complexity, we presented results on the
computational complexity of conditioning, embedding sharp-P-complete problems
in the task of computing conditional probabilities for diffuse continuous
random variables. This work complements older work. For example, under
cryptographic assumptions, the computational complexity of producing samples
and computing probabilities was separated by Ben-David, Chor, Goldreich and
Luby. In recent work, we also make use of cryptographic assumptions to show
that different representations of exchangeable sequences may have vastly
different complexity. However, when faced with an adversary that is
computational bounded, these different representations have the same
complexity, highlighting the fact that knowledge representation and
approximation play a fundamental role in the possibility and plausibility of
Bayesian reasoning.

Yunus Saatçi.
**Scalable inference for
structured Gaussian process models**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** The generic inference and learning
algorithm for Gaussian Process (GP) regression has O(N^{3}) runtime
and O(N^{2}) memory complexity, where N is the number of observations
in the dataset. Given the computational resources available to a present-day
workstation, this implies that GP regression simply *cannot be run* on
large datasets. The need to use non- Gaussian likelihood functions for tasks
such as classification adds even more to the computational burden
involved.

The majority of algorithms designed to improve the scaling of
GPs are founded on the idea of approximating the true covariance matrix,
which is usually of rank N, with a matrix of rank P, where P<<N.
Typically, the true training set is replaced with a smaller, representative
(pseudo-) training set such that a specific measure of information loss is
minimized. These algorithms typically attain O(P^{2}N) runtime and
O(PN) space complexity. They are also general in the sense that they are
designed to work with *any* covariance function. In essence, they
trade off accuracy with computational complexity. The central contribution of
this thesis is to improve scaling instead by exploiting any structure that is
present in the covariance matrices generated by *particular*
covariance functions. Instead of settling for a kernel-independent
accuracy/complexity trade off, as is done in much the literature, we often
obtain accuracies close to, or exactly equal to the full GP model at a
fraction of the computational cost.

We define a *structured* GP as
any GP model that is endowed with a kernel which produces structured
covariance matrices. A trivial example of a structured GP is one with the
linear regression kernel. In this case, given inputs living in R^{D},
the covariance matrices generated have rank D - this results in significant
computational gains in the usual case where D<<N. Another case arises
when a stationary kernel is evaluated on equispaced, scalar inputs. This
results in *Toeplitz* covariance matrices and all necessary
computations can be carried out exactly in O(N log N).

This thesis
studies four more types of structured GP. First, we comprehensively review
the case of kernels corresponding to *Gauss-Markov* processes
evaluated on scalar inputs. Using state-space models we show how
(generalised) regression (including hyperparameter learning) can be performed
in O(N log N) runtime and O(N) space. Secondly, we study the case where we
introduce block structure into the covariance matrix of a GP time-series
model by assuming a particular form of nonstationarity a priori. Third, we
extend the efficiency of scalar Gauss-Markov processes to higher-dimensional
input spaces by assuming *additivity*. We illustrate the connections
between the classical backfitting algorithm and approximate Bayesian
inference techniques including Gibbs sampling and variational Bayes. We also
show that it is possible to relax the rather strong assumption of additivity
without sacrificing O(N log N) complexity, by means of a projection-pursuit
style GP regression model. Finally, we study the properties of a GP model
with a tensor product kernel evaluated on a multivariate grid of inputs
locations. We show that for an *arbitrary* (regular or irregular) grid
the resulting covariance matrices are Kronecker and full GP regression can be
implemented in O(N) time and memory usage.

We illustrate the power of
these methods on several real-world regression datasets which satisfy the
assumptions inherent in the structured GP employed. In many cases we obtain
performance comparable to the generic GP algorithm. We also analyse the
performance degradation when these assumptions are not met, and in several
cases show that it is comparable to that observed for sparse GP methods. We
provide similar results for regression tasks with non-Gaussian likelihoods,
an extension rarely addressed by sparse GP techniques.

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, and Denis
Burdakov.
**Dichotomous cellular properties of mouse orexin/hypocretin
neurons**.
*The Journal of Physiology*, 2011.

** Abstract:**
Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as
critical regulators of sleep-wake cycles, reward-seeking, and body energy
balance. However, at the level of cellular and network properties, it remains
unclear whether hcrt/orx neurons are one homogenous population, or whether
there are several distinct types of hcrt/orx cells. Here, we collated diverse
structural and functional information about individual hcrt/orx neurons in
mouse brain slices, by combining patch-clamp analysis of spike firing,
membrane currents, and synaptic inputs with confocal imaging of cell shape
and subsequent 3-dimensional Sholl analysis of dendritic architecture.
Statistical cluster analysis of intrinsic firing properties revealed that
hcrt/orx neurons fall into two distinct types. These two cell types also
differ in the complexity of their dendritic arbour, the strength of AMPA and
GABAA receptor-mediated synaptic drive that they receive, and the density of
low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results
provide quantitative evidence that, at the cellular level, the mouse hcrt/orx
system is composed of two classes of neurons with different firing
properties, morphologies, and synaptic input organization.

David Sontag and Daniel M. Roy.
**The Complexity of
Inference in Latent Dirichlet Allocation**.
In *Advances in Neural Information Processing Systems 24*, Cambridge,
MA, USA, 2011. The MIT Press.

** Abstract:** We consider the
computational complexity of probabilistic inference in Latent Dirichlet
Allocation (LDA). First, we study the problem of finding the maximum a
posteriori (MAP) assignment of topics to words, where the document's topic
distribution is integrated out. We show that, when the effective number of
topics per document is small, exact inference takes polynomial time. In
contrast, we show that, when a document has a large number of topics, finding
the MAP assignment of topics to words in LDA is NP-hard. Next, we consider
the problem of finding the MAP topic distribution for a document, where the
topic-word assignments are integrated out. We show that this problem is also
NP-hard. Finally, we briefly discuss the problem of sampling from the
posterior, showing that this is NP-hard in one restricted setting, but
leaving open the general question.

Ryan Darby Turner.
**Gaussian processes for
state space models and change point detection**.
PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK,
2011.

** Abstract:** This thesis details several applications
of Gaussian processes (GPs) for enhanced time series modeling. We first cover
different approaches for using Gaussian processes in time series problems.
These are extended to the state space approach to time series in two
different problems. We also combine Gaussian processes and Bayesian online
change point detection (BOCPD) to increase the generality of the Gaussian
process time series methods. These methodologies are evaluated on predictive
performance on six real world data sets, which include three environmental
data sets, one financial, one biological, and one from industrial well
drilling.

Gaussian processes are capable of generalizing standard linear
time series models. We cover two approaches: the Gaussian process time series
model (GPTS) and the autoregressive Gaussian process (ARGP). We cover a
variety of methods that greatly reduce the computational and memory
complexity of Gaussian process approaches, which are generally cubic in
computational complexity.

Two different improvements to state space based
approaches are covered. First, Gaussian process inference and learning (GPIL)
generalizes linear dynamical systems (LDS), for which the Kalman filter is
based, to general nonlinear systems for nonparametric system identification.
Second, we address pathologies in the unscented Kalman filter (UKF). We use
Gaussian process optimization (GPO) to learn UKF settings that minimize the
potential for sigma point collapse.

We show how to embed mentioned
Gaussian process approaches to time series into a change point framework. Old
data, from an old regime, that hinders predictive performance is
automatically and elegantly phased out. The computational improvements for
Gaussian process time series approaches are of even greater use in the change
point framework. We also present a supervised framework learning a change
point model when change point labels are available in training.

These
mentioned methodologies significantly improve predictive performance on the
diverse set of data sets selected.

Richard E. Turner and Maneesh Sahani.
**Two problems with variational expectation maximisation for
time-series models**.
In D. Barber, T. Cemgil, and S. Chiappa, editors, *Bayesian Time series
models*, chapter 5, pages 109-130. Cambridge University Press,
2011.

** Abstract:** Variational methods are a key component
of the approximate inference and learning toolbox. These methods fill an
important middle ground, retaining distributional information about
uncertainty in latent variables, unlike maximum a posteriori methods (MAP),
and yet generally requiring less computational time than Monte Carlo Markov
Chain methods. In particular the variational Expectation Maximisation (vEM)
and variational Bayes algorithms, both involving variational optimisation of
a free-energy, are widely used in time-series modelling. Here, we investigate
the success of vEM in simple probabilistic time-series models. First we
consider the inference step of vEM, and show that a consequence of the
well-known compactness property of variational inference is a failure to
propagate uncertainty in time, thus limiting the usefulness of the retained
distributional information. In particular, the uncertainty may appear to be
smallest precisely when the approximation is poorest. Second, we consider
parameter learning and analytically reveal systematic biases in the
parameters found by vEM. Surprisingly, simpler variational approximations
(such a mean-field) can lead to less bias than more complicated structured
approximations.

Richard E. Turner and Maneesh Sahani.
**Demodulation as probabilistic inference**.
*Transactions on Audio, Speech and Language Processing*, 19:2398-2411,
2011.

** Abstract:** Demodulation is an ill-posed problem
whenever both carrier and envelope signals are broadband and unknown. Here,
we approach this problem using the methods of probabilistic inference. The
new approach, called Probabilistic Amplitude Demodulation (PAD), is
computationally challenging but improves on existing methods in a number of
ways. By contrast to previous approaches to demodulation, it satisfies five
key desiderata: PAD has soft constraints because it is probabilistic; PAD is
able to automatically adjust to the signal because it learns parameters; PAD
is user-steerable because the solution can be shaped by user-specific prior
information; PAD is robust to broad-band noise because this is modelled
explicitly; and PAD’s solution is self-consistent, empirically satisfying a
Carrier Identity property. Furthermore, the probabilistic view naturally
encompasses noise and uncertainty, allowing PAD to cope with missing data and
return error bars on carrier and envelope estimates. Finally, we show that
when PAD is applied to a bandpass-filtered signal, the stop-band energy of
the inferred carrier is minimal, making PAD well-suited to sub-band
demodulation.

Richard E. Turner and Maneesh Sahani.
**Probabilistic amplitude and frequency demodulation**.
In *Advances in Neural Information Processing Systems 24*, pages
981-989. The MIT Press, 2011.

** Abstract:** A number of
recent scientific and engineering problems require signals to be decomposed
into a product of a slowly varying positive envelope and a quickly varying
carrier whose instantaneous frequency also varies slowly over time. Although
signal processing provides algorithms for so-called amplitude- and
frequency-demodulation (AFD), there are well known problems with all of the
existing methods. Motivated by the fact that AFD is ill-posed, we approach
the problem using probabilistic inference. The new approach, called
probabilistic amplitude and frequency demodulation (PAFD), models
instantaneous frequency using an auto-regressive generalization of the von
Mises distribution, and the envelopes using Gaussian auto-regressive dynamics
with a positivity constraint. A novel form of expectation propagation is used
for inference. We demonstrate that although PAFD is computationally
demanding, it outperforms previous approaches on synthetic and real signals
in clean, noisy and missing data settings.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Generalised
Wishart processes**.
In *27nd Conference on Uncertainty in Artificial Intelligence*, 2011.

** Abstract:** We introduce a new stochastic process called the
generalised Wishart process (GWP). It is a collection of positive
semi-definite random matrices indexed by any arbitrary input variable. We use
this process as a prior over dynamic (e.g. time varying) covariance matrices.
The GWP captures a diverse class of covariance dynamics, naturally hanles
missing data, scales nicely with dimension, has easily interpretable
parameters, and can use input variables that include covariates other than
time. We describe how to construct the GWP, introduce general procedures for
inference and prediction, and show that it outperforms its main competitor,
multivariate GARCH, even on financial data that especially suits GARCH.

** Comment:** Supplementary
Material, Best Student Paper Award

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani.
**Gaussian process
regression networks**.
Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering,
University of Cambridge, Cambridge, UK, October 19 2011.

**
Abstract:** We introduce a new regression framework, Gaussian process
regression networks (GPRN), which combines the structural properties of
Bayesian neural networks with the non-parametric flexibility of Gaussian
processes. This model accommodates input dependent signal and noise
correlations between multiple response variables, input dependent
length-scales and amplitudes, and heavy-tailed predictive distributions. We
derive both efficient Markov chain Monte Carlo and variational Bayes
inference procedures for this model. We apply GPRN as a multiple output
regression and multivariate volatility model, demonstrating substantially
improved performance over eight popular multiple output (multi-task) Gaussian
process models and three multivariate volatility models on benchmark
datasets, including a 1000 dimensional gene expression dataset.

** Comment:** arXiv:1110.4411

M. Zhao, A. P. Batista, J. P. Cunningham, C. A. Chestek, Z. Rivera-Alvidrez,
R. Kalmar, S. I. Ryu, K. V. Shenoy, and S. Iyengar.
**An
L1-regularized logistic model for detecting short-term neuronal
interactions.**.
*Journal of Computational Neuroscience*, 2011, doi
10.1007/s10827-011-0365-5.
In Press.

** Abstract:** Interactions among neurons are a key
com- ponent of neural signal processing. Rich neural data sets potentially
containing evidence of interactions can now be collected readily in the
laboratory, but existing analysis methods are often not sufficiently
sensitive and specific to reveal these interactions. Generalized linear
models offer a platform for analyzing multi-electrode recordings of neuronal
spike train data. Here we suggest an L1-regularized logistic regression model
(L1L method) to detect short-term (order of 3ms) neuronal interactions. We
estimate the parameters in this model using a coordinate descent algorithm,
and determine the optimal tuning parameter using a Bayesian Information
Criterion. Simulation studies show that in general the L1L method has better
sensitivities and specificities than those of the traditional
shuffle-corrected cross-correlogram (covariogram) method. The L1L method is
able to detect excitatory interactions with both high sensitivity and
specificity with reasonably large recordings, even when the magnitude of the
interactions is small; similar results hold for inhibition given sufficiently
high baseline firing rates. Our study also suggests that the false positives
can be further removed by thresholding, because their magnitudes are
typically smaller than true interactions. Simulations also show that the L1L
method is somewhat robust to partially observed networks. We apply the method
to multi-electrode recordings collected in the monkey dorsal premotor cortex
(PMd) while the animal prepares to make reaching arm movements. The results
show that some neurons interact differently depending on task conditions. The
stronger interactions detected with our L1L method were also visible using
the covariogram method.

R. P. Adams, Zoubin Ghahramani, and Michael I. Jordan.
**Tree-structured
stick breaking for hierarchical data**.
In *Advances in Neural Information Processing Systems 23*. The MIT
Press, 2010.

** Abstract:** Many data are naturally modeled by
an unobserved hierarchical structure. In this paper we propose a flexible
nonparametric prior over unknown data hierarchies. The approach uses nested
stick-breaking processes to allow for trees of unbounded width and depth,
where data can live at any node and are infinitely exchangeable. One can view
our model as providing infinite mixtures where the components have a
dependency structure corresponding to an evolutionary diffusion down a tree.
By using a stick-breaking approach, we can apply Markov chain Monte Carlo
methods based on slice sampling to perform Bayesian inference and simulate
from the posterior distribution on trees. We apply our method to hierarchical
clustering of images and topic modeling of text data.

R. P. Adams, H. Wallach, and Zoubin Ghahramani.
**Learning the
structure of deep sparse graphical models**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, pages 1-8, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Deep belief
networks are a powerful way to model complex probability distributions.
However, it is difficult to learn the structure of a belief network,
particularly one with hidden units. The Indian buffet process has been used
as a nonparametric Bayesian prior on the structure of a directed belief
network with a single infinitely wide hidden layer. Here, we introduce the
cascading Indian buffet process (CIBP), which provides a prior on the
structure of a layered, directed belief network that is unbounded in both
depth and width, yet allows tractable inference. We use the CIBP prior with
the nonlinear Gaussian belief network framework to allow each unit to vary
its behavior between discrete and continuous representations. We use Markov
chain Monte Carlo for inference in this model and explore the structures
learned on image data.

** Comment:** Winner of the Best Paper Award

Daniel A. Braun and Pedro A. Ortega.
**A minimum relative
entropy principle for adaptive control in linear quadratic
regulators**.
In *Proceedings of the 7th international conference on informatics in
control, automation and robotics*, page (in press), 2010.

**
Abstract:** The design of optimal adaptive controllers is usually based on
heuristics, because solving Bellman's equations over information states is
notoriously intractable. Approximate adaptive controllers often rely on the
principle of certainty-equivalence where the control process deals with
parameter point estimates as if they represented ``true'' parameter values.
Here we present a stochastic control rule instead where controls are sampled
from a posterior distribution over a set of probabilistic input-output models
and the true model is identified by Bayesian inference. This allows
reformulating the adaptive control problem as an inference and sampling
problem derived from a minimum relative entropy principle. Importantly,
inference and action sampling both work forward in time and hence such a
Bayesian adaptive controller is applicable on-line. We demonstrate the
improved performance that can be achieved by such an approach for linear
quadratic regulator examples.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin
Ghahramani.
**Scaling the
iHMM: Parallelization versus Hadoop**.
In *Proceedings of the 2010 10th IEEE International Conference on Computer
and Information Technology*, pages 1235-1240, Bradford, UK, 2010. IEEE
Computer Society, doi
10.1109/CIT.2010.223.

** Abstract:** This paper compares
parallel and distributed implementations of an iterative, Gibbs sampling,
machine learning algorithm. Distributed implementations run under Hadoop on
facility computing clouds. The probabilistic model under study is the
infinite HMM Beal, Ghahramani and Rasmussen,
2002, in which parameters are learnt using an instance blocked Gibbs
sampling, with a step consisting of a dynamic program. We apply this model to
learn part-of-speech tags from newswire text in an unsupervised fashion.
However our focus here is on runtime performance, as opposed to NLP-relevant
scores, embodied by iteration duration, ease of development, deployment and
debugging.

M. M. Churchland, J. P. Cunningham, M. T. Kaufman, S. I. Ryu, and K. V. Shenoy.
**Cortical preparatory activity: Representation of movement or first cog in a
dynamical machine?**.
*Neuron*, 68:387-400, 2010.

** Abstract:** The motor
cortices are active during both movement and movement preparation. A common
assumption is that preparatory activity constitutes a subthreshold form of
movement activity: a neuron active during rightward movements becomes
modestly active during preparation of a rightward movement. We asked whether
this pattern of activity is, in fact, observed. We found that it was not: at
the level of a single neuron, preparatory tuning was weakly correlated with
movement-period tuning. Yet, somewhat paradoxically, preparatory tuning could
be captured by a preferred direction in an abstract "space" that described
the population-level pattern of movement activity. In fact, this relationship
accounted for preparatory responses better than did traditional tuning
models. These results are expected if preparatory activity provides the
initial state of a dynamical system whose evolution produces movement
activity. Our results thus suggest that preparatory activity may not
represent specific factors, and may instead play a more mechanistic role.

M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, M. R. Cohen, G. S.
Corrado, W. T. Newsome, A. M. Clark, P. Hosseini, B. B. Scott, D. C. Bradley,
M. A. Smith, A. Kohn, J. A. Movshon, K. M. Armstrong, T. Moore, S. W. Chang,
L. H. Snyder, S. G. Lisberger, N. J. Priebe, I. M. Finn, D. Ferster, S. I.
Ryu, G. Santhanam, M. Sahani, and K. V. Shenoy.
**Stimulus onset quashes neural variability: a widespread cortical phenomenon**.
*Nature Neuroscience*, 13:369-378, 2010.

** Abstract:**
Neural responses are typically characterized by computing the mean firing
rate, but response variability can exist across trials. Many studies have
examined the effect of a stimulus on the mean response, but few have examined
the effect on response variability. We measured neural variability in 13
extracellularly recorded datasets and one intracellularly recorded dataset
from seven areas spanning the four cortical lobes in monkeys and cats. In
every case, stimulus onset caused a decline in neural variability. This
occurred even when the stimulus produced little change in mean firing rate.
The variability decline was observed in membrane potential recordings, in the
spiking of individual neurons and in correlated spiking variability measured
with implanted 96-electrode arrays. The variability decline was observed for
all stimuli tested, regardless of whether the animal was awake, behaving or
anaesthetized. This widespread variability decline suggests a rather general
property of cortex, that its state is stabilized by an input.

Marc Peter Deisenroth.
**Efficient reinforcement
learning using Gaussian processes**.
PhD thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2010.

** Abstract:** In many research areas, including control and
medical applications, we face decision-making problems where data are limited
and/or the underlying generative process is complicated and partially
unknown. In these scenarios, we can profit from algorithms that learn from
data and aid decision making.

Reinforcement learning (RL) is a general
computational approach to experience-based goal-directed learning for
sequential decision making under uncertainty. However, RL often lacks
efficiency in terms of the number of required trials when no task-specific
knowledge is available. This lack of efficiency makes RL often inapplicable
to (optimal) control problems. Thus, a central issue in RL is to speed up
learning by extracting more information from available experience.

The
contributions of this dissertation are threefold:

1. We propose PILCO, a
fully Bayesian approach for efficient RL in continuous-valued state and
action spaces when no expert knowledge is available. PILCO is based on
well-established ideas from statistics and machine learning. PILCO's key
ingredient is a probabilistic dynamics model learned from data, which is
implemented by a Gaussian process (GP). The GP carefully quantifies knowledge
by a probability distribution over plausible dynamics models. By averaging
over all these models during long-term planning and decision making, PILCO
takes uncertainties into account in a principled way and, therefore, reduces
model bias, a central problem in model-based RL.

2. Due to its generality
and efficiency, PILCO can be considered a conceptual and practical approach
to jointly learning models and controllers when expert knowledge is difficult
to obtain or simply not available. For this scenario, we investigate PILCO's
properties its applicability to challenging real and simulated nonlinear
control problems. For example, we consider the tasks of learning to swing up
a double pendulum attached to a cart or to balance a unicycle with five
degrees of freedom. Across all tasks we report unprecedented automation and
an unprecedented learning efficiency for solving these tasks.

3. As a
step toward pilco's extension to partially observable Markov decision
processes, we propose a principled algorithm for robust filtering and
smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for
nonlinear systems, it does neither rely on function linearization nor on
finite-sample representations of densities. Our algorithm profits from exact
moment matching for predictions while keeping all computations analytically
tractable. We present experimental evidence that demonstrates the robustness
and the advantages of our method over unscented Kalman filters, the cubature
Kalman filter, and the extended Kalman filter.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, , and Tim D.
Spector.
**No correlation between childhood maltreatment and telomere length.**.
*Biological Psychiatry*, 68(6):21-22, 2010.

**
Abstract:** Telomeres are lengths of repetitive DNA that cap the ends of
chromosomes. They protect the ends of the chromosome and shorten with each
cell division. Short leukocyte telomere length has been related to a number
of age-related diseases. In addition, shorter telomere length has been
associated with environmental factors such as smoking and lack of exercise.
In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a
report suggesting a link between maltreatment in childhood and telomere
shortening in 31 subjects. Individuals who had suffered maltreatment had
telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had
not been abused.

Dilan Görür and Carl Edward Rasmussen.
**Dirichlet process
Gaussian mixture models: Choice of the base distribution**.
*Journal of Computer Science and Technology*, 25(4):615-625, July 2010,
doi
10.1007/s11390-010-9355-8.

** Abstract:** In the Bayesian
mixture modeling framework it is possible to infer the necessary number of
components to model the data and therefore it is unnecessary to explicitly
restrict the number of components. Nonparametric mixture models sidestep the
problem of finding the "correct" number of mixture components by assuming
infinitely many components. In this paper Dirichlet process mixture (DPM)
models are cast as infinite mixture models and inference using Markov chain
Monte Carlo is described. The specification of the priors on the model
parameters is often guided by mathematical and practical convenience. The
primary goal of this paper is to compare the choice of conjugate and
non-conjugate base distributions on a particular class of DPM models which is
widely used in applications, the Dirichlet process Gaussian mixture model
(DPGMM). We compare computational efficiency and modeling performance of
DPGMM defined using a conjugate and a conditionally conjugate base
distribution. We show that better density models can result from using a
wider class of priors with no or only a modest increase in computational
effort.

Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani.
**Variational
inference for nonparametric multiple clustering**.
In *KDD10 Workshop on Discovering, Summarizing, and Using Multiple
Clusterings*, Washington, DC, USA, July 2010.

**
Abstract:** Most clustering algorithms produce a single clustering
solution. Similarly, feature selection for clustering tries to find one
feature subset where one interesting clustering solution resides. However, a
single data set may be multi-faceted and can be grouped and interpreted in
many different ways, especially for high dimensional data, where feature
selection is typically needed. Moreover, different clustering solutions are
interesting for different purposes. Instead of committing to one clustering
solution, in this paper we introduce a probabilistic nonparametric Bayesian
model that can discover several possible clustering solutions and the feature
subset views that generated each cluster partitioning simultaneously. We
provide a variational inference approach to learn the features and clustering
partitions in each view. Our model allows us not only to learn the multiple
clusterings and views but also allows us to automatically learn the number of
views and the number of clusters in each view.

Ferenc Huszár, Uta Noppeney, and Máté Lengyel.
**Mind reading by
machine learning: A doubly Bayesian method for inferring mental
representations**.
In S. Ohlsson and R. Catrambone, editors, *The Proceedings of the 32nd
Annual Meeting of the Cognitive Science Society*, Austin, TX, USA, August
2010. The Cognitive Science Society.

** Abstract:** A central
challenge in cognitive science is to measure and quantify the mental
representations humans develop - in other words, to 'read' subject's minds.
In order to eliminate potential biases in reporting mental contents due to
verbal elaboration, subjects' responses in experiments are often limited to
binary decisions or discrete choices that do not require conscious reflection
upon their mental contents. However, it is unclear what such impoverished
data can tell us about the potential richness and dynamics of subjects'
mental representations. To address this problem, we used ideal observer
models that formalise choice behaviour as (quasi-)Bayes-optimal, given
subjects' representations in long-term memory, acquired through prior
learning, and the stimuli currently available to them. Bayesian inversion of
such ideal observer models allowed us to infer subjects' mental
representation from their choice behaviour in a variety of psychophysical
tasks. The inferred mental representations also allowed us to predict future
choices of subjects with reasonable accuracy, even in tasks that were
different from those in which the representations were estimated. These
results demonstrate a significant potential in standard binary decision tasks
to recover detailed information about subjects' mental representations

** Comment:** Supplementary material available here.

G. Kasneci, J. Van Gael, T. Graepel, and R. Herbrich.
**Bayesian
knowledge corroboration with logical rules and user feedback**.
In *European Conference on Machine Learning (ECML)*, Barcelona, Spain,
September 2010.

** Abstract:** Current knowledge bases suffer
from either low coverage or low accuracy. The underlying hypothesis of this
work is that user feedback can greatly improve the quality of automatically
extracted knowledge bases. The feedback could help quantify the uncertainty
associated with the stored statements and would enable mechanisms for
searching, ranking and reasoning at entity-relationship level. Most
importantly, a principled model for exploiting user feedback to learn the
truth values of statements in the knowledge base would be a major step
forward in addressing the issue of knowledge base curation. We present a
family of probabilistic graphical models that builds on user feedback and
logical inference rules derived from the popular Semantic-Web formalism of
RDFS [1]. Through internal
inference and belief propagation, these models can learn both, the truth
values of the statements in the knowledge base and the reliabilities of the
users who give feedback. We demonstrate the viability of our approach in
extensive experiments on real-world datasets, with feedback collected from
Amazon Mechanical Turk.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn.
**Modeling skin
and ageing phenotypes using latent variable models in infer.net**.
In *NIPS Workshop: Predictive Models in Personalized Medicine Workshop*,
2010.

** Abstract:** We demonstrate and compare three
unsupervised Bayesian latent variable models implemented in Infer.NET for
biomedical data modeling of 42 skin and ageing phenotypes measured on the
12,000 female twins in the Twins UK study. We address various data modeling
problems include high missingness, heterogeneous data, and repeat
observations. We compare the proposed models in terms of their performance at
predicting disease labels and symptoms from available explanatory variables,
concluding that factor analysis type models have the strongest statistical
performance in this setting. We show that such models can be combined with
regression components for improved interpretability.

** Comment:** web
site

Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen,
and Aníbal Figueiras-Vidal.
**Sparse spectrum Gaussian process regression**.
*Journal of Machine Learning Research*, 11:1865-1881, June 2010.

** Abstract:** We present a new sparse Gaussian Process (GP)
model for regression. The key novel idea is to sparsify the *spectral
representation* of the GP. This leads to a simple, practical algorithm for
regression tasks. We compare the achievable trade-offs between predictive
accuracy and computational requirements, and show that these are typically
superior to existing state-of-the-art sparse approximations. We discuss both
the weight space and function space representations, and note that the new
construction implies priors over functions which are always stationary, and
can approximate any covariance function in this class.

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani.
**Kronecker
graphs: An approach to modeling networks**.
*Journal of Machine Learning Research*, 11(Feb):985-1042, 2010.

** Abstract:** How can we generate realistic networks? In
addition, how can we do so with a mathematically tractable model that allows
for rigorous analysis of network properties? Real networks exhibit a long
list of surprising properties: Heavy tails for the in- and out-degree
distribution, heavy tails for the eigenvalues and eigenvectors, small
diameters, and densification and shrinking diameters over time. Current
network models and generators either fail to match several of the above
properties, are complicated to analyze mathematically, or both. Here we
propose a generative model for networks that is both mathematically tractable
and can generate networks that have all the above mentioned structural
properties. Our main idea here is to use a non-standard matrix operation, the
Kronecker product, to generate graphs which we refer to as "Kronecker
graphs".

First, we show that Kronecker graphs naturally obey common
network properties. In fact, we rigorously prove that they do so. We also
provide empirical evidence showing that Kronecker graphs can effectively
model the structure of real networks.

We then present KRONFIT, a fast and
scalable algorithm for fitting the Kronecker graph generation model to large
real networks. A naive approach to fitting would take super-exponential time.
In contrast, KRONFIT takes linear time, by exploiting the structure of
Kronecker matrix multiplication and by using statistical simulation
techniques. Experiments on a wide range of large real and synthetic networks
show that KRONFIT finds accurate parameters that very well mimic the
properties of target networks. In fact, using just four parameters we can
accurately model several aspects of global network structure. Once fitted,
the model parameters can be used to gain insights about the network
structure, and the resulting synthetic graphs can be used for null-models,
anonymization, extrapolations, and graph summarization.

C. Lippert, Z. Ghahramani, and K. Borgwardt.
**Gene function
prediction from synthetic lethality networks via ranking on demand**.
*Bioinformatics*, 26:912-918, 2010.

** Abstract:**
Motivation: Synthetic lethal interactions represent pairs of genes whose
individual mutations are not lethal, while the double mutation of both genes
does incur lethality. Several studies have shown a correlation between
functional similarity of genes and their distances in networks based on
synthetic lethal interactions. However, there is a lack of algorithms for
predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for
gene function prediction from synthetic lethal interaction networks based on
kernel machines. We apply our novel algorithm to Gene Ontology functional
annotation prediction in yeast. Our experiments show that our method leads to
improved gene function prediction compared with state-of-the-art competitors
and that combining genetic and congruence networks leads to a further
improvement in prediction accuracy.

Hannes Nickisch and Carl Edward Rasmussen.
**Gaussian mixture
modeling with Gaussian process latent variable models**.
In *Proceedings of the 32nd DAGM Symposium on Pattern Recognition*,
Lecture Notes in Computer Science (LNCS), Darmstadt, Germany, September 2010.
Springer, doi
10.1007/978-3-642-15986-2_28.

** Abstract:** Density
modeling is notoriously difficult for high dimensional data. One approach to
the problem is to search for a lower dimensional manifold which captures the
main characteristics of the data. Recently, the Gaussian Process Latent
Variable Model (GPLVM) has successfully been used to find low dimensional
manifolds in a variety of complex data. The GPLVM consists of a set of points
in a low dimensional latent space, and a stochastic map to the observed
space. We show how it can be interpreted as a density model in the observed
space. However, the GPLVM is not trained as a density model and therefore
yields bad density estimates. We propose a new training strategy and obtain
improved generalisation performance and better density estimates in
comparative evaluations on several benchmark data sets.

Pedro A. Ortega and Daniel A. Braun.
**A
conversion between utility and information**.
In *The third conference on artificial general intelligence*, pages
115-120, Paris, 2010. Atlantis Press.

** Abstract:** Rewards
typically express desirabilities or preferences over a set of alternatives.
Here we propose that rewards can be defined for any probability distribution
based on three desiderata, namely that rewards should be real- valued,
additive and order-preserving, where the later implies that more probable
events should also be more desirable. Our main result states that rewards are
then uniquely determined by the negative information content. To analyze
stochastic processes, we define the utility of a realization as its reward
rate. Under this interpretation, we show that the expected utility of a
stochastic process is its negative entropy rate. Furthermore, we apply our
results to analyze agent-environment interactions. We show that the expected
utility that will actually be achieved by the agent is given by the negative
cross-entropy from the input-output (I/O) distribution of the coupled
interaction system and the agent's I/O distribution. Thus, our results allow
for an information-theoretic interpretation of the notion of utility and the
characterization of agent-environment interactions in terms of entropy
dynamics.

Pedro A. Ortega and Daniel A. Braun.
**A
Bayesian rule for adaptive control based on causal interventions**.
In *The third conference on artificial general intelligence*, pages
115-120, Paris, 2010. Atlantis Press.

** Abstract:**
Explaining adaptive behavior is a central problem in artificial intelligence
research. Here we formalize adaptive agents as mixture distributions over
sequences of inputs and outputs (I/O). Each distribution of the mixture
constitutes a "possible world", but the agent does not know which of the
possible worlds it is actually facing. The problem is to adapt the I/O stream
in a way that is compatible with the true world. A natural measure of
adaptation can be obtained by the Kullback Leibler (KL) divergence between
the I/O distribution of the true world and the I/O distribution expected by
the agent that is uncertain about possible worlds. In the case of pure input
streams, the Bayesian mixture provides a well-known solution for this
problem. We show, however, that in the case of I/O streams this solution
breaks down, because outputs are issued by the agent itself and require a
different probabilistic syntax as provided by intervention calculus. Based on
this calculus, we obtain a Bayesian control rule that allows modeling
adaptive behavior with mixture distributions over I/O streams. This rule
might allow for a novel approach to adaptive control based on a minimum
KL-principle.

Pedro A. Ortega and Daniel A. Braun.
**A minimum relative entropy
principle for learning and acting**.
*Journal of Artificial Intelligence Research*, 38:475-511, 2010, doi 10.1613/jair.3062.

** Abstract:** This paper proposes a method to construct an
adaptive agent that is universal with respect to a given class of experts,
where each expert is designed specifically for a particular environment. This
adaptive control problem is formalized as the problem of minimizing the
relative entropy of the adaptive agent from the expert that is most suitable
for the unknown environment. If the agent is a passive observer, then the
optimal solution is the well-known Bayesian predictor. However, if the agent
is active, then its past actions need to be treated as causal interventions
on the I/O stream rather than normal probability conditions. Here it is shown
that the solution to this new variational problem is given by a stochastic
controller called the Bayesian control rule, which implements adaptive
behavior as a mixture of experts. Furthermore, it is shown that under mild
assumptions, the Bayesian control rule converges to the control law of the
most suitable expert.

Pedro A. Ortega and Daniel A. Braun.
**An axiomatic formalization of
bounded rationality based on a utility-information equivalence**.
Technical report, Dept. of Engineering, University of Cambridge, 2010.

** Abstract:** Classic decision-theory is based on the maximum
expected utility (MEU) principle, but crucially ignores the resource costs
incurred when determining optimal decisions. Here we propose an axiomatic
framework for bounded decision-making that considers resource costs. Agents
are formalized as probability measures over input-output streams. We
postulate that any such probability measure can be assigned a corresponding
conjugate utility function based on three axioms: utilities should be
real-valued, additive and monotonic mappings of probabilities. We show that
these axioms enforce a unique conversion law between utility and probability
(and thereby, information). Moreover, we show that this relation can be
characterized as a variational principle: given a utility function, its
conjugate probability measure maximizes a free utility functional.
Transformations of probability measures can then be formalized as a change in
free utility due to the addition of new constraints expressed by a target
utility function. Accordingly, one obtains a criterion to choose a
probability measure that trades off the maximization of a target utility
function and the cost of the deviation from a reference distribution. We show
that optimal control, adaptive estimation and adaptive control problems can
be solved this way in a resource-efficient way. When resource costs are
ignored, the MEU principle is recovered. Our formalization might thus provide
a principled approach to bounded rationality that establishes a close link to
information theory.

Carl Edward Rasmussen and Hannes Nickisch.
**Gaussian Processes for Machine Learning (GPML) Toolbox**.
*Journal of Machine Learning Research*, 11:3011-3015, December 2010.

** Abstract:** The GPML toolbox provides a wide range of
functionality for Gaussian process (GP) inference and prediction. GPs are
specified by mean and covariance functions; we offer a library of simple mean
and covariance functions and mechanisms to compose more complex ones. Several
likelihood functions are supported including Gaussian and heavy-tailed for
regression as well as others suitable for classification. Finally, a range of
inference methods is provided, including exact and variational inference,
Expectation Propagation, and Laplace's method dealing with non-Gaussian
likelihoods and FITC for dealing with large regression tasks.

** Comment:** Toolbox avaiable from here. Implements algorithms
from Rasmussen and Williams, 2006.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani.
**Traffic
classification in information poor environments**.
In *1st International Workshop on Traffic Analysis and Classification (IWCMC
'10)*, Caen, France, July 2010.

** Abstract:** Traffic
classification using machine learning continues to be an active research
area. The majority of work in this area uses *off-the-shelf* machine
learning tools and treats them as *black-box* classifiers. This approach
turns all the modelling complexity into a feature selection problem. In this
paper, we build a problem-specific solution to the traffic classification
problem by designing a custom probabilistic graphical model. Graphical models
are a modular framework to design classifiers which incorporate
domain-specific knowledge. More specifically, our solution introduces
semi-supervised learning which means we learn from both labelled and
unlabelled traffic flows. We show that our solution performs competitively
compared to previous approaches while using less data and simpler
features.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani.
**Probabilistic
graphical models for semi-supervised traffic classification**.
In *The 6th International Wireless Communications and Mobile Computing
Conference*, pages 752-757, Caen, France, 2010.

**
Abstract:** Traffic classification using machine learning continues to be
an active research area. The majority of work in this area uses off-the-shelf
machine learning tools and treats them as black-box classifiers. This
approach turns all the modelling complexity into a feature selection problem.
In this paper, we build a problem-specific solution to the traffic
classification problem by designing a custom probabilistic graphical model.
Graphical models are a modular framework to design classifiers which
incorporate domain-specific knowledge. More specifically, our solution
introduces semi-supervised learning which means we learn from both labelled
and unlabelled traffic flows. We show that our solution performs
competitively compared to previous approaches while using less data and
simpler features.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen.
**Gaussian process
change point models**.
In *27th International Conference on Machine Learning*, pages 927-934,
Haifa, Israel, June 2010.

** Abstract:** We combine Bayesian
online change point detection with Gaussian processes to create a
nonparametric time series model which can handle change points. The model can
be used to locate change points in an online manner; and, unlike other
Bayesian online change point detection algorithms, is applicable when
temporal correlations in a regime are expected. We show three variations on
how to apply Gaussian processes in the change point context, each with their
own advantages. We present methods to reduce the computational burden of
these models and demonstrate it on several real world data sets.

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, and D. L. Wild.
**Discovering
transcriptional modules by Bayesian data integration**.
*Bioinformatics*, 26:i158-i167, 2010.

** Abstract:**
Motivation: We present a method for directly inferring transcriptional
modules (TMs) by integrating gene expression and transcription factor binding
(ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture
model to allow data fusion on a gene-by-gene basis. This encodes the
intuition that co-expression and co-regulation are not necessarily equivalent
and hence we do not expect all genes to group similarly in both datasets. In
particular, it allows us to identify the subset of genes that share the same
structure of transcriptional modules in both datasets.

Results: We find
that by working on a gene-by-gene basis, our model is able to extract
clusters with greater functional coherence than existing methods. By
combining gene expression and transcription factor binding (ChIP-chip) data
in this way, we are better able to determine the groups of genes that are
most likely to represent underlying TMs.

Availability: If interested in
the code for the work presented in this article, please contact the
authors.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi.
**Ranking
relations using analogies in biological and information networks**.
*Annals of Applied Statistics*, 4(2):615-644, 2010.

**
Abstract:** Analogical reasoning depends fundamentally on the ability to
learn and generalize about relations between objects. We develop an approach
to relational learning which, given a set of pairs of objects S = A(1):B(1),
A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the
set S. Our work addresses the question: is the relation between objects A and
B analogous to those relations found in S? Such questions are particularly
relevant in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest. There
are many ways in which objects can be related, making the task of measuring
analogies very challenging. Our approach combines a similarity measure on
function spaces with Bayesian analysis to produce a ranking. It requires data
containing features of the objects of interest and a link matrix specifying
which relationships exist; no further attributes of such relationships are
necessary. We illustrate the potential of our method on text analysis and
information networks. An application on discovering functional interactions
between pairs of proteins is discussed in detail, where we show that our
approach can work in practice even if a small set of protein pairs is
provided.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M.
Borgwardt.
**A
robust Bayesian two-sample test for detecting intervals of differential
gene expression in microarray time series**.
*Journal of Computational Biology*, 17(3):1-13, 2010, doi
10.1089/cmb.2009.0175.

** Abstract:** Understanding the
regulatory mechanisms that are responsible for an organism's response to
environmental change is an important issue in molecular biology. A first and
important step towards this goal is to detect genes whose expression levels
are affected by altered external conditions. A range of methods to test for
differential gene expression, both in static as well as in time-course
experiments, have been proposed. While these tests answer the question
*whether* a gene is differentially expressed, they do not explicitly
address the question *when* a gene is differentially expressed, although
this information may provide insights into the course and causal structure of
regulatory programs. In this article, we propose a twosample test for
identifying intervals of differential gene expression in microarray time
series. Our approach is based on Gaussian process regression, can deal with
arbitrary numbers of replicates, and is robust with respect to outliers. We
apply our algorithm to study the response of *Arabidopsis thaliana*
genes to an infection by a fungal pathogen using a microarray time series
dataset covering 30,336 gene probes at 24 observed time points. In
classification experiments, our test compares favorably with existing methods
and provides additional insights into time-dependent differential
expression.

Richard E. Turner.
**Statistical Models for Natural Sounds**.
PhD thesis, Gatsby Computational Neuroscience Unit, UCL, 2010.

**
Abstract:** It is important to understand the rich structure of natural
sounds in order to solve important tasks, like automatic speech recognition,
and to understand auditory processing in the brain. This thesis takes a step
in this direction by characterising the statistics of simple natural sounds.
We focus on the statistics because perception often appears to depend on
them, rather than on the raw waveform. For example the perception of auditory
textures, like running water, wind, fire and rain, depends on
summary-statistics, like the rate of falling rain droplets, rather than on
the exact details of the physical source. In order to analyse the statistics
of sounds accurately it is necessary to improve a number of traditional
signal processing methods, including those for amplitude demodulation,
time-frequency analysis, and sub-band demodulation. These estimation tasks
are ill-posed and therefore it is natural to treat them as Bayesian inference
problems. The new probabilistic versions of these methods have several
advantages. For example, they perform more accurately on natural signals and
are more robust to noise, they can also fill-in missing sections of data, and
provide error-bars. Furthermore, free-parameters can be learned from the
signal. Using these new algorithms we demonstrate that the energy, sparsity,
modulation depth and modulation time-scale in each sub-band of a signal are
critical statistics, together with the dependencies between the sub-band
modulators. In order to validate this claim, a model containing co-modulated
coloured noise carriers is shown to be capable of generating a range of
realistic sounding auditory textures. Finally, we explored the connection
between the statistics of natural sounds and perception. We demonstrate that
inference in the model for auditory textures qualitatively replicates the
primitive grouping rules that listeners use to understand simple acoustic
scenes. This suggests that the auditory system is optimised for the
statistics of natural sounds.

Ryan Turner, Steven Bottone, and Zoubin Ghahramani.
**Fast online
anomaly detection using scan statistics**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 385-390,
Kittilä, Finland, August 2010.

** Abstract:** We present
methods to do fast online anomaly detection using scan statistics. Scan
statistics have long been used to detect statistically significant bursts of
events. We extend the scan statistics framework to handle many practical
issues that occur in application: dealing with an unknown background rate of
events, allowing for slow natural changes in background frequency, the
inverse problem of finding an unusual lack of events, and setting the test
parameters to maximize power. We demonstrate its use on real and synthetic
data sets with comparison to other methods.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**State-space
inference and learning with Gaussian processes**.
In Yee Whye Teh and Mike Titterington, editors, *13th International
Conference on Artificial Intelligence and Statistics*, volume 9 of
*W & CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15
2010. Journal of Machine Learning Research.

** Abstract:**
State-space inference and learning with Gaussian processes (GPs) is an
unsolved problem. We propose a new, general methodology for inference and
learning in nonlinear state-space models that are described probabilistically
by non-parametric GP models. We apply the expectation maximization algorithm
to iterate between inference in the latent state-space and learning the
parameters of the underlying GP dynamics model.

** Comment:** poster.

Ryan Turner and Carl Edward Rasmussen.
**Model based learning
of sigma points in unscented Kalman filtering**.
In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors,
*Machine Learning for Signal Processing (MLSP 2010)*, pages 178-183,
Kittilä, Finland, August 2010.

** Abstract:** The
unscented Kalman filter (UKF) is a widely used method in control and time
series applications. The UKF suffers from arbitrary parameters necessary for
a step known as sigma point placement, causing it to perform poorly in
nonlinear problems. We show how to treat sigma point placement in a UKF as a
learning problem in a model based view. We demonstrate that learning to place
the sigma points correctly from data can make sigma point collapse much less
likely. Learning can result in a significant increase in predictive
performance over default settings of the parameters in the UKF and other
filters designed to avoid the problems of the UKF, such as the GP-ADF. At the
same time, we maintain a lower computational complexity than the other
methods. We call our method UKF-L.

Richard E. Turner and Maneesh Sahani.
**Statistical inference for single- and multi-band probabilistic
amplitude demodulation.**.
In *Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP)*, pages 5466-5469, 2010.

**
Abstract:** Amplitude demodulation is an ill-posed problem and so it is
natural to treat it from a Bayesian viewpoint, inferring the most likely
carrier and envelope under probabilistic constraints. One such treatment is
Probabilistic Amplitude Demodulation (PAD), which, whilst computationally
more intensive than traditional approaches, offers several advantages. Here
we provide methods for estimating the uncertainty in the PAD-derived
envelopes and carriers, and for learning free-parameters like the time-scale
of the envelope. We show how the probabilistic approach can naturally handle
noisy and missing data. Finally, we indicate how to extend the model to
signals which contain multiple modulators and carriers.

Andreas Vlachos, Zoubin Ghahramani, and Ted Briscoe.
**Active learning
for constrained Dirichlet process mixture models**.
In *Proceedings of the 2010 Workshop on Geometrical Models of Natural
Language Semantics*, pages 57-61, Uppsala, Sweden, 2010.

**
Abstract:** Recent work applied Dirichlet Process Mixture Models to the
task of verb clustering, incorporating supervision in the form of must-links
and cannot-links constraints between instances. In this work, we introduce an
active learning approach for constraint selection employing uncertainty-based
sampling. We achieve substantial improvements over random selection on two
datasets.

Andrew Gordon Wilson and Zoubin Ghahramani.
**Copula
processes**.
In *Advances in Neural Information Processing Systems 23*, 2010.
Spotlight.

** Abstract:** We define a copula process which
describes the dependencies between arbitrarily many random variables
independently of their marginal distributions. As an example, we develop a
stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to
predict the latent standard deviations of a sequence of random variables. To
make predictions we use Bayesian inference, with the Laplace approximation,
and with Markov chain Monte Carlo as an alternative. We find our model can
outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can
easily handle missing data, incorporate covariates other than time, and model
a rich class of covariance structures.

** Comment:** Supplementary Material, slides.

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani.
**Dependent
Indian buffet processes**.
In *13th International Conference on Artificial Intelligence and
Statistics*, volume 9 of *W & CP*, pages 924-931, Chia
Laguna, Sardinia, Italy, May 2010.

** Abstract:** Latent
variable models represent hidden structure in observational data. To account
for the distribution of the observational data changing over time, space or
some other covariate, we need generalizations of latent variable models that
explicitly capture this dependency on the covariate. A variety of such
generalizations has been proposed for latent variable models based on the
Dirichlet process. We address dependency on covariates in binary latent
feature models, by introducing a dependent Indian Buffet Process. The model
generates a binary random matrix with an unbounded number of columns for each
value of the covariate. Evolution of the binary matrices over the covariate
set is controlled by a hierarchical Gaussian process model. The choice of
covariance functions controls the dependence structure and exchangeability
properties of the model. We derive a Markov Chain Monte Carlo sampling
algorithm for Bayesian inference, and provide experiments on both synthetic
and real-world data. The experimental results show that explicit modeling of
dependencies significantly improves accuracy of predictions.

Sinead Williamson, Katherine A. Heller, C. Wang, and D. M. Blei.
**The IBP
compound Dirichlet process and its application to focused topic
modeling**.
In *27th International Conference on Machine Learning*, pages
1151-1158, Haifa, Israel, June 2010.

** Abstract:** The
hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed
membership model - each data point is modeled with a collection of
components of different proportions. Though powerful, the HDP makes an
assumption that the probability of a component being exhibited by a data
point is positively correlated with its proportion within that data point.
This might be an undesirable assumption. For example, in topic modeling, a
topic (component) might be rare throughout the corpus but dominant within
those documents (data points) where it occurs. We develop the IBP compound
Dirichlet process (ICD), a Bayesian nonparametric prior that decouples
across-data prevalence and within-data proportion in a mixed membership
model. The ICD combines properties from the HDP and the Indian buffet process
(IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a
subset of the shared mixture components to each data point. This subset, the
data point's "focus", is determined independently from the amount that each
of its components contribute. We develop an ICD mixture model for text, the
focused topic model (FTM), and show superior performance over the HDP-based
topic model.

R. Adams and Zoubin Ghahramani.
**Archipelago:
nonparametric Bayesian semi-supervised learning**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 1-8, Montréal, QC, Canada,
June 2009. Omnipress.

** Abstract:** Semi-supervised learning
(SSL), is classification where additional unlabeled data can be used to
improve accuracy. Generative approaches are appealing in this situation, as a
model of the data's probability density can assist in identifying clusters.
Nonparametric Bayesian methods, while ideal in theory due to their principled
motivations, have been difficult to apply to SSL in practice. We present a
nonparametric Bayesian method that uses Gaussian processes for the generative
model, avoiding many of the problems associated with Dirichlet process
mixture models. Our model is fully generative and we take advantage of recent
advances in Markov chain Monte Carlo algorithms to provide a practical
inference method. Our method compares favorably to competing approaches on
synthetic and real-world multi-class data.

** Comment:** This paper was awarded Honourable Mention for
Best Paper at ICML 2009.

Morten Arngren, Mikkel N. Schmidt, and Jan Larsen.
**Bayesian
nonnegative matrix factorization with volume prior for unmixing of
hyperspectral images**.
In *Machine Learning for Signal Processing, IEEE Workshop on (MLSP)*,
pages 1-6, Grenoble, France, September 2009. doi
10.1109/MLSP.2009.5306262.

** Abstract:** In hyperspectral
image analysis the objective is to unmix a set of acquired pixels into pure
spectral signatures (endmembers) and corresponding fractional abundances. The
Non-negative Matrix Factorization (NMF) methods have received a lot of
attention for this unmixing process. Many of these NMF based unmixing
algorithms are based on sparsity regularization encouraging pure spectral
endmembers, but this is not optimal for certain applications, such as foods,
where abundances are not sparse. The pixels will theoretically lie on a
simplex and hence the endmembers can be estimated as the vertices of the
smallest enclosing simplex. In this context we present a Bayesian framework
employing a volume constraint for the NMF algorithm, where the posterior
distribution is numerically sampled from using a Gibbs sampling procedure. We
evaluate the method on synthetical and real hyperspectral data of wheat
kernels.

** Comment:** This paper was "rated among the best papers
submitted" to the 2009 Machine Learning for Signal Processing conference.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**A structured model of video reproduces primary visual cortical
organisation**.
*PLoS Computational Biology*, 5(9), 09 2009, doi
10.1371/journal.pcbi.1000495.

** Abstract:** The visual
system must learn to infer the presence of objects and features in the world
from the images it encounters, and as such it must, either implicitly or
explicitly, model the way these elements interact to create the image. Do the
response properties of cells in the mammalian visual system reflect this
constraint? To address this question, we constructed a probabilistic model in
which the identity and attributes of simple visual elements were represented
explicitly and learnt the parameters of this model from unparsed, natural
video sequences. After learning, the behaviour and grouping of variables in
the probabilistic model corresponded closely to functional and anatomical
properties of simple and complex cells in the primary visual cortex (V1). In
particular, feature identity variables were activated in a way that resembled
the activity of complex cells, while feature attribute variables responded
much like simple cells. Furthermore, the grouping of the attributes within
the model closely parallelled the reported anatomical grouping of simple
cells in cat V1. Thus, this generative model makes explicit an interpretation
of complex and simple cells as elements in the segmentation of a visual scene
into basic independent features, along with a parametrisation of their
moment-by-moment appearances. We speculate that such a segmentation may form
the initial stage of a hierarchical system that progressively separates the
identity and appearance of more articulated visual elements, culminating in
view-invariant object recognition.

Daniel A. Braun, Pedro A. Ortega, and Daniel M. Wolpert.
**Nash equilibria in
multi-agent motor interactions**.
*PLoS Computational Biology*, 5(8), 2009.

** Abstract:**
Social interactions in classic cognitive games like the ultimatum game or the
prisoner's dilemma typically lead to Nash equilibria when multiple
competitive decision makers with perfect knowledge select optimal strategies.
However, in evolutionary game theory it has been shown that Nash equilibria
can also arise as attractors in dynamical systems that can describe, for
example, the population dynamics of microorganisms. Similar to such
evolutionary dynamics, we find that Nash equilibria arise naturally in motor
interactions in which players vie for control and try to minimize effort.
When confronted with sensorimotor interaction tasks that correspond to the
classical prisoner's dilemma and the rope-pulling game, two-player motor
interactions led predominantly to Nash solutions. In contrast, when a single
player took both roles, playing the sensorimotor game bimanually, cooperative
solutions were found. Our methodology opens up a new avenue for the study of
human motor interactions within a game theoretic framework, suggesting that
the coupling of motor systems can lead to game theoretic solutions.

C. Chang, J. P. Cunningham, and G. Glover.
**Influence of heart rate on the bold signal: the cardiac response
function**.
*NeuroImage*, 44:857-869, 2009.

** Abstract:** It has
previously been shown that low-frequency fluctuations in both respiratory
volume and cardiac rate can induce changes in the blood-oxygen level
dependent (BOLD) signal. Such physiological noise can obscure the detection
of neural activation using fMRI, and it is therefore important to model and
remove the effects of this noise. While a hemodynamic response function
relating respiratory variation (RV) and the BOLD signal has been described,
no such mapping for heart rate (HR) has been proposed. In the current study,
the effects of RV and HR are simultaneously deconvolved from resting state
fMRI. It is demonstrated that a convolution model including RV and HR can
explain significantly more variance in gray matter BOLD signal than a model
that includes RV alone, and an average HR response function is proposed that
well characterizes our subject population. It is observed that the voxel-wise
morphology of the deconvolved RV responses is preserved when HR is included
in the model, and that its form is adequately modeled by Birn et al.'s
previously described respiration response function. Furthermore, it is shown
that modeling out RV and HR can significantly alter functional connectivity
maps of the default-mode network.

W. Chu and Z. Ghahramani.
**Probabilistic models
for incomplete multi-dimensional arrays**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 89-96,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** In multiway data, each sample is
measured by multiple sets of correlated attributes. We develop a
probabilistic framework for modeling structural dependency from partially
observed multi-dimensional array data, known as pTucker. Latent components
associated with individual array dimensions are jointly retrieved while the
core tensor is integrated out. The resulting algorithm is capable of handling
large-scale data sets. We verify the usefulness of this approach by comparing
against classical models on applications to modeling amino acid fluorescence,
collaborative filtering and a number of benchmark multiway array data.

Marc Peter Deisenroth, Marco F. Huber, and Uwe D. Hanebeck.
**Analytic
moment-based Gaussian process filtering**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 225-232, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We propose an
analytic moment-based filter for nonlinear stochastic dynamic systems modeled
by Gaussian processes. Exact expressions for the expected value and the
covariance matrix are provided for both the prediction step and the filter
step, where an additional Gaussian assumption is exploited in the latter
case. Our filter does not require further approximations. In particular, it
avoids finite-sample approximations. We compare the filter to a variety of
Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed
by Ko et al. (2007).

** Comment:** With corrections. code.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Bayesian inference
for efficient learning in control**.
In *Multidisciplinary Symposium on Reinforcement Learning*,
Montréal, QC, Canada, June 2009.

** Abstract:** In
contrast to humans or animals, artificial learners often require more trials
when learning motor control tasks solely based on experience. Efficient
autonomous learners will reduce the amount of engineering required to solve
control problems. By using probabilistic forward models, we can employ two
key ingredients of biological learning systems to speed up artificial
learning. We present a consistent and coherent Bayesian framework that allows
for efficient autonomous experience-based learning. We demonstrate the
success of our learning algorithm by applying it to challenging nonlinear
control problems in simulation and in hardware.

Marc Peter Deisenroth and Carl Edward Rasmussen.
**Efficient
reinforcement learning for motor control**.
In *10th International PhD Workshop on Systems and Control*, Hluboká
nad Vltavou, Czech Republic, September 2009.

** Abstract:**
Artificial learners often require many more trials than humans or animals
when learning motor control tasks in the absence of expert knowledge. We
implement two key ingredients of biological learning systems, generalization
and incorporation of uncertainty into the decision-making process, to speed
up artificial learning. We present a coherent and fully Bayesian framework
that allows for efficient artificial learning in the absence of expert
knowledge. The success of our learning framework is demonstrated on
challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Gaussian process
dynamic programming**.
*Neurocomputing*, 72(7-9):1508-1524, March 2009, doi
10.1016/j.neucom.2008.12.019.

** Abstract:** Reinforcement
learning (RL) and optimal control of systems with continuous states and
actions require approximation techniques in most interesting cases. In this
article, we introduce Gaussian process dynamic programming (GPDP), an
approximate value function-based RL algorithm. We consider both a classic
optimal control problem, where problem-specific prior knowledge is available,
and a classic RL problem, where only very general priors can be used. For the
classic optimal control problem, GPDP models the unknown value functions with
Gaussian processes and generalizes dynamic programming to continuous-valued
states and actions. For the RL problem, GPDP starts from a given initial
state and explores the state space using Bayesian active learning. To design
a fast learner, available data have to be used efficiently. Hence, we propose
to learn probabilistic models of the a priori unknown transition dynamics and
the value functions on the fly. In both cases, we successfully apply the
resulting continuous-valued controllers to the under-actuated pendulum swing
up and analyze the performances of the suggested algorithms. It turns out
that GPDP uses data very efficiently and can be applied to problems, where
classic dynamic programming would be cumbersome.

** Comment:** code.

Finale Doshi-Velez.
**The Indian buffet
process: Scalable inference and extensions**.
Master's thesis, University of Cambridge, Cambridge, UK, August 2009.

** Abstract:** Many unsupervised learning problems seek to
identify hidden features from observations. In many real-world situations,
the number of hidden features is unknown. To avoid specifying the number of
hidden features a priori, one can use the Indian Buffet Process (IBP): a
nonparametric latent feature model that does not bound the number of active
features in a dataset. While elegant, the lack of efficient inference
procedures for the IBP has prevented its application in large-scale problems.
The core contribution of this thesis are three new inference procedures that
allow inference in the IBP to be scaled from a few hundred to 100,000
observations. This thesis contains three parts: (1) An introduction to the
IBP and a review of inference techniques and extensions. The first chapters
summarise three constructions for the IBP and review all currently published
inference techniques. Appendix C reviews extensions of the IBP to date. (2)
Novel techniques for scalable Bayesian inference. This thesis presents three
new inference procedures: (a) an accelerated Gibbs sampler for efficient
Bayesian inference in a broad class of conjugate models, (b) a parallel,
asynchronous Gibbs sampler that allows the accelerated Gibbs sampler to be
distributed across multiple processors, and (c) a variational inference
procedure for the IBP. (3) A framework for structured nonparametric latent
feature models. We also present extensions to the IBP to model more
sophisticated relationships between the co-occurring hidden features,
providing a general framework for correlated non-parametric feature
models.

Finale Doshi-Velez.
**The infinite partially
observable Markov decision process**.
In *Advances in Neural Information Processing Systems 23*, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:** The
Partially Observable Markov Decision Process (POMDP) framework has proven
useful in planning domains where agents must balance actions that provide
knowledge and actions that provide reward. Unfortunately, most POMDPs are
complex structures with a large number of parameters. In many real-world
problems, both the structure and the parameters are difficult to specify from
domain knowledge alone. Recent work in Bayesian reinforcement learning has
made headway in learning POMDP models; however, this work has largely focused
on learning the parameters of the POMDP model. We define an infinite POMDP
(iPOMDP) model that does not require knowledge of the size of the state
space; instead, it assumes that the number of visited states will grow as the
agent explores its world and only models visited states explicitly. We
demonstrate the iPOMDP on several standard problems.

Finale Doshi-Velez and Zoubin Ghahramani.
**Accelerated Gibbs
sampling for the Indian buffet process**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 273-280, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We often seek to
identify co-occurring hidden features in a set of observations. The Indian
Buffet Process (IBP) provides a non-parametric prior on the features present
in each observation, but current inference techniques for the IBP often scale
poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in
the number of observations, and the uncollapsed Gibbs sampler, while linear,
is often slow to mix. We present a new linear-time collapsed Gibbs sampler
for conjugate likelihood models and demonstrate its efficacy on large
real-world datasets.

F. Doshi-Velez and Z. Ghahramani.
**Correlated
non-parametric latent feature models**.
In *Conference on Uncertainty in Artificial Intelligence (UAI 2009)*,
pages 143-150, Montréal, QC, Canada, June 2009. AUAI Press.

** Abstract:** We are often interested in explaining data
through a set of hidden factors or features. To allow for an unknown number
of such hidden features, one can use the IBP: a non-parametric latent feature
model that does not bound the number of active features in a dataset.
However, the IBP assumes that all latent features are uncorrelated, making it
inadequate for many real-world problems. We introduce a framework for
correlated non-parametric feature models, generalising the IBP. We use this
framework to generate several specific models and demonstrate applications on
real-world datasets.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani.
**Large scale
non-parametric inference: Data parallelisation in the Indian buffet
process**.
In *Advances in Neural Information Processing Systems 23*, pages
1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** Nonparametric Bayesian models provide a framework
for flexible probabilistic modelling of complex datasets. Unfortunately, the
high-dimensional averages required for Bayesian methods can be slow,
especially with the unbounded representations used by nonparametric models.
We address the challenge of scaling Bayesian inference to the increasingly
large datasets found in real-world applications. We focus on parallelisation
of inference in the Indian Buffet Process (IBP), which allows data points to
have an unbounded number of sparse latent features. Our novel MCMC sampler
divides a large data set between multiple processors and uses message passing
to compute the global likelihoods and posteriors. This algorithm, the first
parallel inference scheme for IBP-based models, scales to datasets orders of
magnitude larger than have previously been possible.

F. Doshi-Velez, K.T. Miller, J. Van Gael, and Y.W. Teh.
**Variational
inference for the Indian buffet process**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 12, pages 137-144, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.

**
Abstract:** The Indian Buffet Process (IBP) is a nonparametric prior for
latent feature models in which observations are influenced by a combination
of hidden features. For example, images may be composed of several objects
and sounds may consist of several notes. Latent feature models seek to infer
these unobserved features from a set of observations; the IBP provides a
principled prior in situations where the number of hidden features is
unknown. Current inference methods for the IBP have all relied on sampling.
While these methods are guaranteed to be accurate in the limit, samplers for
the IBP tend to mix slowly in practice. We develop a deterministic
variational method for inference in the IBP based on a truncated
stick-breaking approximation, provide theoretical bounds on the truncation
error, and evaluate our method in several data regimes.

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh.
**Variational
inference for the Indian buffet process**.
Technical Report CBL-2009-001, University of Cambridge, Computational and
Biological Learning Laboratory, Department of Engineering, April 2009.

** Abstract:** The Indian Buffet Process (IBP) is a
nonparametric prior for latent feature models in which observations are
influenced by a combination of hidden features. For example, images may be
composed of several objects and sounds may consist of several notes. Latent
feature models seek to infer these unobserved features from a set of
observations; the IBP provides a principled prior in situations where the
number of hidden features is unknown. Current inference methods for the IBP
have all relied on sampling. While these methods are guaranteed to be
accurate in the limit, samplers for the IBP tend to mix slowly in practice.
We develop a deterministic variational method for inference in the IBP based
on truncating to infinite models, provide theoretical bounds on the
truncation error, and evaluate our method in several data regimes. This
technical report is a longer version of Doshi-Velez et al. (2009).

Frederik Eaton and Zoubin Ghahramani.
**Choosing a variable
to clamp: Approximate inference using conditioned belief
propagation**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 145-152,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning
Research.

** Abstract:** In this paper we propose an algorithm
for approximate inference on graphical models based on belief propagation
(BP). Our algorithm is an approximate version of Cutset Conditioning, in
which a subset of variables is instantiated to make the rest of the graph
singly connected. We relax the constraint of single-connectedness, and select
variables one at a time for conditioning, running belief propagation after
each selection. We consider the problem of determining the best variable to
clamp at each level of recursion, and propose a fast heuristic which applies
back-propagation to the BP updates. We demonstrate that the heuristic
performs better than selecting variables at random, and give experimental
results which show that it performs competitively with existing approximate
inference algorithms.

** Comment:** Code (in C++
based on libDAI).

David A. Knowles and Susan Holmes.
**Statistical tools
for ultra-deep pyrosequencing of fast evolving viruses**.
In *NIPS Workshop: Computational Biology*, 2009.

**
Abstract:** We aim to detect minor variant Hepatitis B viruses (HBV) in 38
pyrosequencing samples from infected individuals. Errors involved in the
amplification and ultra deep pyrosequencing (UDPS) of these samples are
characterised using HBV plasmid controls. Homopolymeric regions and quality
scores are found to be significant covariates in determining insertion and
deletion (indel) error rates, but not mismatch rates which depend on the
nucleotide transition matrix. This knowledge is used to derive two methods
for classifying genuine mutations: a hypothesis testing framework and a
mixture model. Using an approximate "ground truth" from a limiting dilution
Sanger sequencing run, these methods are shown to outperform the naive
percentage threshold approach. The possibility of early stage PCR errors
becoming significant is investigated by simulation, which underlines the
importance of the initial copy number.

** Comment:** web site

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt.
**A kernel
method for unsupervised structured network inference**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 368-375,
Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Network inference is the problem
of inferring edges between a set of real-world objects, for instance,
interactions between pairs of proteins in bioinformatics. Current
kernel-based approaches to this problem share a set of common features: (i)
they are supervised and hence require labeled training data; (ii) edges in
the network are treated as mutually independent and hence topological
properties are largely ignored; (iii) they lack a statistical interpretation.
We argue that these common assumptions are often undesirable for network
inference, and propose (i) an unsupervised kernel method (ii) that takes the
global structure of the network into account and (iii) is statistically
motivated. We show that our approach can explain commonly used heuristics in
statistical terms. In experiments on social networks, dfferent variants of
our method demonstrate appealing predictive performance.

Jörg Lücke, Richard E. Turner, Maneesh Sahani, and Marc Henniges.
**Occlusive components analysis**.
In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors,
*nips22*, pages 1069-1077. mit, 2009.

** Abstract:**
We study unsupervised learning in a probabilistic generative model for
occlusion. The model uses two types of latent variables: one indicates which
objects are present in the image, and the other how they are ordered in
depth. This depth order then determines how the positions and appearances of
the objects present, specified in the model parameters, combine to form the
image. We show that the object parameters can be learnt from an unlabelled
set of images in which objects occlude one another. Exact maximum-likelihood
learning is intractable. However, we show that tractable approximations to
Expectation Maximization (EM) can be found if the training images each
contain only a small number of objects on average. In numerical experiments
it is shown that these approximations recover the correct set of object
parameters. Experiments on a novel version of the bars test using colored
bars, and experiments on more realistic data, show that the algorithm
performs well in extracting the generating causes. Experiments based on the
standard bars benchmark test for object learning show that the algorithm
performs well in comparison to other recent component extraction approaches.
The model and the learning algorithm thus connect research on occlusion with
the research field of multiple-causes component extraction methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani.
**Bayesian
exponential family PCA**.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, *Advances in
Neural Information Processing Systems 21*, pages 1089-1096, Cambridge,
MA, USA, December 2009. The MIT Press.

** Abstract:**
Principal Components Analysis (PCA) has become established as one of the key
tools for dimensionality reduction when dealing with real valued data.
Approaches such as exponential family PCA and non-negative matrix
factorisation have successfully extended PCA to non-Gaussian data types, but
these techniques fail to take advantage of Bayesian inference and can suffer
from problems of overfitting and poor generalisation. This paper presents a
fully probabilistic approach to PCA, which is generalised to the exponential
family, based on Hybrid Monte Carlo sampling. We describe the model which is
based on a factorisation of the observed data matrix, and show performance of
the model on both synthetic and real data.

** Comment:** spotlight.

Peter Orbanz.
**Construction of
nonparametric Bayesian models from parametric Bayes equations**.
In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,
editors, *Advances in Neural Information Processing Systems 22*, pages
1392-1400. The MIT Press, 2009.

** Abstract:** We consider
the general problem of constructing nonparametric Bayesian models on
infinite-dimensional random objects, such as functions, infinite graphs or
infinite permutations. The problem has generated much interest in machine
learning, where it is treated heuristically, but has not been studied in full
generality in nonparametric Bayesian statistics, which tends to focus on
models over probability distributions. Our approach applies a standard tool
of stochastic process theory, the construction of stochastic processes from
their finite-dimensional marginal distributions. The main contribution of the
paper is a generalization of the classic Kolmogorov extension theorem to
conditional probabilities. This extension allows a rigorous construction of
nonparametric Bayesian models from systems of finitedimensional, parametric
Bayes equations. Using this approach, we show (i) how existence of a
conjugate posterior for the nonparametric model can be guaranteed by choosing
conjugate finite-dimensional models in the construction, (ii) how the mapping
to the posterior parameters of the nonparametric model can be explicitly
determined, and (iii) that the construction of conjugate models in essence
requires the finite-dimensional models to be in the exponential family. As an
application of our constructive framework, we derive a model on infinite
permutations, the nonparametric Bayesian analogue of a model recently
proposed for the analysis of rank data.

** Comment:** Supplements (proofs) and techreport version

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L.
Wild.
**Modeling and visualizing
uncertainty in gene expression clusters using Dirichlet process
mixtures**.
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,
6(4):615-628, 2009, doi
10.1109/TCBB.2007.70269.

** Abstract:** Although the use
of clustering methods has rapidly become one of the standard computational
approaches in the literature of microarray gene expression data, little
attention has been paid to uncertainty in the results obtained. Dirichlet
process mixture (DPM) models provide a nonparametric Bayesian alternative to
the bootstrap approach to modeling uncertainty in gene expression clustering.
Most previously published applications of Bayesian model-based clustering
methods have been to short time series data. In this paper, we present a case
study of the application of nonparametric Bayesian clustering methods to the
clustering of high-dimensional nontime series gene expression data using full
Gaussian covariances. We use the probability that two genes belong to the
same cluster in a DPM model as a measure of the similarity of these gene
expression profiles. Conversely, this probability can be used to define a
dissimilarity measure, which, for the purposes of visualization, can be input
to one of the standard linkage algorithms used for hierarchical clustering.
Biologically plausible results are obtained from the Rosetta compendium of
expression profiles which extend previously published cluster analyses of
this data.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant,
K. Denby, and D. L. Wild.
**R/BHC: fast
Bayesian hierarchical clustering for microarray data**.
*BMC Bioinformatics 2009*, 10(242):1-9, August 2009, doi
10.1186/1471-2105-10-242.

** Abstract:** Background:
Although the use of clustering methods has rapidly become one of the standard
computational approaches in the literature of microarray gene expression data
analysis, little attention has been paid to uncertainty in the results
obtained.

Results: We present an R/Bioconductor port of a fast novel
algorithm for Bayesian agglomerative hierarchical clustering and demonstrate
its use in clustering gene expression microarray data. The method performs
bottom-up hierarchical clustering, using a Dirichlet Process (infinite
mixture) to model uncertainty in the data and Bayesian model selection to
decide at each step which clusters to merge.

Conclusion: Biologically
plausible results are presented from a well studied data set: expression
profiles of *A. thaliana* subjected to a variety of biotic and abiotic
stresses. Our method avoids several limitations of traditional methods, for
example how many clusters there should be and how to choose a principled
distance metric.

Mikkel N. Schmidt.
**Function factorization
using warped Gaussian processes**.
In Léon Bottou and Michael Littman, editors, *26th International
Conference on Machine Learning*, pages 921-928, Montréal, QC,
Canada, June 2009. Omnipress.

** Abstract:** We introduce a
new approach to non-linear regression called function factorization, that is
suitable for problems where an output variable can reasonably be modeled by a
number of multiplicative interaction terms between non-linear functions of
the inputs. The idea is to approximate a complicated function on a
high-dimensional space by the sum of products of simpler functions on
lower-dimensional subspaces. Function factorization can be seen as a
generalization of matrix and tensor factorization methods, in which the data
are approximated by the sum of outer products of vectors. We present a
non-parametric Bayesian approach to function factorization where the priors
over the factorizing functions are warped Gaussian processes, and we do
inference using Hamiltonian Markov chain Monte Carlo. We demonstrate the
superior predictive performance of the method on a food science data set
compared to Gaussian process regression and tensor factorization using
PARAFAC and GEMANOVA models.

** Comment:** slides.
poster. video.

Mikkel N. Schmidt.
**Linearly constrained
Bayesian matrix factorization for blind source separation**.
In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,
editors, *Advances in Neural Information Processing Systems 22*, pages
1624-1632, Cambridge, MA, USA, December 2009. The MIT Press.

** Abstract:** We present a general Bayesian approach to
probabilistic matrix factorization subject to linear constraints. The
approach is based on a Gaussian observation model and Gaussian priors with
bilinear equality and inequality constraints. We present an efficient Markov
chain Monte Carlo inference procedure based on Gibbs sampling. Special cases
of the proposed model are Bayesian formulations of non-negative matrix
factorization and factor analysis. The method is evaluated on a blind source
separation problem. We demonstrate that our algorithm can be used to extract
meaningful and interpretable features that are remarkably different from
features extracted using existing related matrix factorization
techniques.

** Comment:** code.

Mikkel N. Schmidt and Shakir Mohamed.
**Probabilistic
non-negative tensor factorization using Markov chain Monte
Carlo**.
In *European Signal Processing Conference (EUSIPCO)*, pages 1918-1922,
Glasgow, Scotland, August 2009.

** Abstract:** We present a
probabilistic model for learning non-negative tensor factorizations (NTF), in
which the tensor factors are latent variables associated with each data
dimension. The non-negativity constraint for the latent factors is handled by
choosing priors with support on the non-negative numbers. Two Bayesian
inference procedures based on Markov chain Monte Carlo sampling are
described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We
evaluate the model on two food science data sets, and show that the
probabilistic NTF model leads to better predictions and avoids overfitting
compared to existing NTF approaches.

** Comment:** Rated by reviewers amongst the top 5% of the
presented papers.

Mikkel N. Schmidt, Ole Winther, and Lars Kai Hansen.
**Bayesian
non-negative matrix factorization**.
In *8th International Conference on Independent Component Analysis and
Signal Separation*, volume 5441 of *Lecture Notes in Computer Science
(LNCS)*, pages 540-547, Paraty, Brazil, March 2009. Springer.

** Abstract:** We present a Bayesian treatment of non-negative
matrix factorization (NMF), based on a normal likelihood and exponential
priors, and derive an efficient Gibbs sampler to approximate the posterior
density of the NMF factors. On a chemical brain imaging data set, we show
that this improves interpretability by providing uncertainty estimates. We
discuss how the Gibbs sampler can be used for model order selection by
estimating the marginal likelihood, and compare with the Bayesian information
criterion. For computing the maximum a posteriori estimate we present an
iterated conditional modes algorithm that rivals existing state-of-the-art
NMF algorithms on an image feature extraction problem.

R. Silva and Z. Ghahramani.
**The hidden life of
latent variables: Bayesian learning with mixed graph models**.
*Journal of Machine Learning Research*, 10:1187-1238, June 2009.

** Abstract:** Directed acyclic graphs (DAGs) have been widely
used as a representation of conditional independence in machine learning and
statistics. Moreover, hidden or latent variables are often an important
component of graphical models. However, DAG models suffer from an important
limitation: the family of DAGs is not closed under marginalization of hidden
variables. This means that in general we cannot use a DAG to represent the
independencies over a subset of variables in a larger DAG. Directed mixed
graphs (DMGs) are a representation that includes DAGs as a special case, and
overcomes this limitation. This paper introduces algorithms for performing
Bayesian inference in Gaussian and probit DMG models. An important
requirement for inference is the specification of the distribution over
parameters of the models. We introduce a new distribution for covariance
matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian
machine learning tasks can benefit from the principle presented here: the
power to model dependencies that are generated from hidden variables, but
without necessarily modeling such variables explicitly.

R. Silva and Z. Ghahramani.
**Factorial mixture
of Gaussians and the marginal independence model**.
In *12th International Conference on Artificial Intelligence and
Statistics*, volume 5, pages 520-527, Clearwater Beach, FL, USA,
April 2009. Journal of Machine Learning Research.
ISSN: 1938-7228.

** Abstract:** Marginal independence
constraints play an important role in learning with graphical models. One way
of parameterizing a model of marginal independencies is by building a latent
variable model where two independent observed variables have no common latent
source. In sparse domains, however, it might be advantageous to model the
marginal observed distribution directly, without explicitly including latent
variables in the model. There have been recent advances in Gaussian and
binary models of marginal independence, but no models with non-linear
dependencies between continuous variables has been proposed so far. In this
paper, we describe how to generalize the Gaussian model of marginal
independencies based on mixtures, and how to learn parameters. This requires
a non-standard parameterization and raises difficult non-linear optimization
issues.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and
K Borgwardt.
**Discovering
temporal patterns of differential gene expression in microarray time
series**.
In *German Conference on Bioinformatics*, pages 133-142, Halle,
Germany, September 2009.

** Abstract:** A wealth of time
series of microarray measurements have become available over recent years.
Several two-sample tests for detecting differential gene expression in these
time series have been defined, but they can only answer the question
*whether* a gene is differentially expressed across the whole time
series, not *in which intervals* it is differentially expressed. In this
article, we propose a Gaussian process based approach for studying these
dynamics of differential gene expression. In experiments on *Arabidopsis
thaliana* gene expression levels, our novel technique helps us to uncover
that the family of WRKY transcription factors appears to be involved in the
early response to infection by a fungal pathogen.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt.
**A robust
Bayesian two-sample test for detecting intervals of differential gene
expression in microarray time series**.
In *13th Annual International Conference on Research in Computational
Molecular Biology (RECOMB 2009)*, volume 5541 of *Lecture Notes in
Bioinformatics*, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag,
doi
10.1007/978-3-642-02008-7_14.

** Abstract:** Understanding
the regulatory mechanisms that are responsible for an organism's response to
environmental changes is an important question in molecular biology. A first
and important step towards this goal is to detect genes whose expression
levels are affected by altered external conditions. A range of methods to
test for differential gene expression, both in static as well as in
time-course experiments, have been proposed. While these tests answer the
question *whether* a gene is differentially expressed, they do not
explicitly address the question *when* a gene is differentially
expressed, although this information may provide insights into the course and
causal structure of regulatory programs. In this article, we propose a
two-sample test for identifying *intervals* of differential gene
expression in microarray time series. Our approach is based on Gaussian
process regression, can deal with arbitrary numbers of replicates and is
robust with respect to outliers. We apply our algorithm to study the response
of *Arabidopsis thaliana* genes to an infection by a fungal pathogen
using a microarray time series dataset covering 30,336 gene probes at 24 time
points. In classification experiments our test compares favorably with
existing methods and provides additional insights into time-dependent
differential expression.

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee.
**The block
diagonal infinite hidden Markov model**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 552-559,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal
of Machine Learning Research.
ISSN 1938-7228.

** Abstract:** The Infinite Hidden Markov Model
(IHMM) extends hidden Markov models to have a countably infinite number of
hidden states (Beal et al., 2002; Teh et al.,
2006). We present a generalization of this framework that introduces nearly
block-diagonal structure in the transitions between the hidden states, where
blocks correspond to "sub-behaviors" exhibited by data sequences. In
identifying such structure, the model classifies, or partitions, sequence
data according to these sub-behaviors in an unsupervised way. We present an
application of this model to artificial data, a video gesture classification
task, and a musical theme labeling task, and show that components of the
model can also be applied to graph segmentation.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen.
**System
identification in Gaussian process dynamical systems**.
In Dilan Görür, editor, *NIPS Workshop on Nonparametric Bayes*,
Whistler, BC, Canada, December 2009.

** Comment:** poster.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen.
**Adaptive
sequential Bayesian change point detection**.
In Zaïd Harchaoui, editor, *NIPS Workshop on Temporal Segmentation*,
Whistler, BC, Canada, December 2009.

** Abstract:** Real-world
time series are often nonstationary with respect to the parameters of some
underlying prediction model (UPM). Furthermore, it is often desirable to
adapt the UPM to incoming regime changes as soon as possible, necessitating
sequential inference about change point locations. A Bayesian algorithm for
online change point detection (BOCPD) has been introduced recently by Adams
and MacKay (2007). In this algorithm, uncertainty about the last change point
location is updated sequentially, and is integrated out to make online
predictions robust to parameter changes. BOCPD requires a set of fixed
hyper-parameters which allow the user to fully specify the hazard function
for change points and the prior distribution over the parameters of the UPM.
In practice, finding the "right" hyper-parameters can be quite difficult. We
therefore extend BOCPD by introducing hyper-parameter learning, without
sacrificing the online nature of the algorithm. Hyper-parameter learning is
performed by optimizing the marginal likelihood of the BOCPD model, a
closed-form quantity which can be computed sequentially. We illustrate
performance on three real-world datasets.

J. Van Gael, A. Vlachos, and Z. Ghahramani.
**The infinite
HMM for unsupervised PoS tagging**.
In *Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (EMNLP)*, pages 678-687, Singapore, August 2009.
Association for Computational Linguistics.

** Abstract:** We
extend previous work on fully unsupervised part-of-speech tagging. Using a
non-parametric version of the HMM, called the infinite HMM (iHMM), we address
the problem of choosing the number of hidden states in unsupervised Markov
models for PoS tagging. We experiment with two non-parametric priors, the
Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using
a parallelized implementation of an iHMM inference algorithm. We evaluate the
results with a variety of clustering evaluation metrics and achieve
equivalent or better performances than previously reported. Building on this
promising result we evaluate the output of the unsupervised PoS tagger as a
direct replacement for the output of a fully supervised PoS tagger for the
task of shallow parsing and compare the two evaluations.

A. Vlachos, A Korhonen, and Z. Ghahramani.
**Unsupervised and
constrained Dirichlet process mixture models for verb clustering**.
In *4th Workshop on Statistical Machine Translation, EACL '09*, Athens,
Greece, March 2009.

** Abstract:** In this work, we apply
Dirichlet Process Mixture Models (DPMMs) to a learning task in natural
language processing (NLP): lexical-semantic verb clustering. We thoroughly
evaluate a method of guiding DPMMs towards a particular clustering solution
using pairwise constraints. The quantitative and qualitative evaluation
performed highlights the benefits of both standard and constrained DPMMs
compared to previously used approaches. In addition, it sheds light on the
use of evaluation measures and their practical application.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani.
**Tree-based
inference for Dirichlet process mixtures**.
In D. van Dyk and M. Welling, editors, *12th International Conference on
Artificial Intelligence and Statistics*, volume 5, pages 623-630,
Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal
of Machine Learning Research (online).
ISSN 1938-7228.

** Abstract:** The Dirichlet process mixture
(DPM) is a widely used model for clustering and for general nonparametric
Bayesian density estimation. Unfortunately, like in many statistical models,
exact inference in a DPM is intractable, and approximate methods are needed
to perform efficient inference. While most attention in the literature has
been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational
Bayesian (VB) [4] and collapsed variational methods [5], [6] recently
introduced a novel class of approximation for DPMs based on Bayesian
hierarchical clustering (BHC). These tree-based combinatorial approximations
efficiently sum over exponentially many ways of partitioning the data and
offer a novel lower bound on the marginal likelihood of the DPM [6]. In this
paper we make the following contributions: (1) We show empirically that the
BHC lower bounds are substantially tighter than the bounds given by VB [4]
and by collapsed variational methods [5] on synthetic and real datasets. (2)
We also show that BHC offers a more accurate predictive performance on these
datasets. (3) We further improve the tree-based lower bounds with an
algorithm that efficiently sums contributions from alternative trees. (4) We
present a fast approximate method for BHC. Our results suggest that our
combinatorial approximate inference methods and lower bounds may be useful
not only in DPMs but in other models as well.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process factor analysis for low-dimensional single-trial analysis of neural
population activity**.
*Journal of Neurophysiology*, 102:614-635, 2009.

**
Abstract:** We consider the problem of extracting smooth, low-dimensional
neural trajectories that summarize the activity recorded simultaneously from
many neurons on individual experimental trials. Beyond the benefit of
visualizing the high-dimensional, noisy spiking activity in a compact form,
such trajectories can offer insight into the dynamics of the neural circuitry
underlying the recorded activity. Current methods for extracting neural
trajectories involve a two-stage process: the spike trains are first smoothed
over time, then a static dimensionality- reduction technique is applied. We
first describe extensions of the two-stage methods that allow the degree of
smoothing to be chosen in a principled way and that account for spiking
variability, which may vary both across neurons and across time. We then
present a novel method for extracting neural trajectories - Gaussian-process
factor analysis (GPFA) - which unifies the smoothing and dimensionality-
reduction operations in a common probabilistic framework. We applied these
methods to the activity of 61 neurons recorded simultaneously in macaque
premotor and motor cortices during reach planning and execution. By adopting
a goodness-of-fit metric that measures how well the activity of each neuron
can be predicted by all other recorded neurons, we found that the proposed
extensions improved the predictive ability of the two-stage methods. The
predictive ability was further improved by going to GPFA. From the extracted
trajectories, we directly observed a convergence in neural state during motor
planning, an effect that was shown indirectly by previous studies. We then
show how such methods can be a powerful tool for relating the spiking
activity across a neural population to the subject's behavior on a
single-trial basis. Finally, to assess how well the proposed methods
characterize neural population activity when the underlying time course is
known, we performed simulations that revealed that GPFA performed tens of
percent better than the best two-stage method.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and
M. Sahani.
**Gaussian-process factor analysis for low-dimensional single-trial analysis of neural
population activity**.
In *Advances in Neural Information Processing Systems 21*, pages 1-8,
Vancouver, BC, December 2009.

** Abstract:** We consider the
problem of extracting smooth, low-dimensional neural trajectories that
summarize the activity recorded simultaneously from many neurons on
individual experimental trials. Beyond the benefit of visualizing the
high-dimensional, noisy spiking activity in a compact form, such trajectories
can offer insight into the dynamics of the neural circuitry underlying the
recorded activity. Current methods for extracting neural trajectories involve
a two-stage process: the spike trains are first smoothed over time, then a
static dimensionality- reduction technique is applied. We first describe
extensions of the two-stage methods that allow the degree of smoothing to be
chosen in a principled way and that account for spiking variability, which
may vary both across neurons and across time. We then present a novel method
for extracting neural trajectories - Gaussian-process factor analysis (GPFA)
- which unifies the smoothing and dimensionality- reduction operations in a
common probabilistic framework. We applied these methods to the activity of
61 neurons recorded simultaneously in macaque premotor and motor cortices
during reach planning and execution. By adopting a goodness-of-fit metric
that measures how well the activity of each neuron can be predicted by all
other recorded neurons, we found that the proposed extensions improved the
predictive ability of the two-stage methods. The predictive ability was
further improved by going to GPFA. From the extracted trajectories, we
directly observed a convergence in neural state during motor planning, an
effect that was shown indirectly by previous studies. We then show how such
methods can be a powerful tool for relating the spiking activity across a
neural population to the subject's behavior on a single-trial basis. Finally,
to assess how well the proposed methods characterize neural population
activity when the underlying time course is known, we performed simulations
that revealed that GPFA performed tens of percent better than the best
two-stage method.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani.
**On sparsity and overcompleteness in image models**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Computational models
of visual cortex, and in particular those based on sparse coding, have
enjoyed much recent attention. Despite this currency, the question of how
sparse or how over-complete a sparse representation should be, has gone
without principled answer. Here, we use Bayesian model-selection methods to
address these questions for a sparse-coding model based on a Student-t prior.
Having validated our methods on toy data, we find that natural images are
indeed best modelled by extremely sparse distributions; although for the
Student-t prior, the associated optimal basis size is only modestly
over-complete.

J. P. Cunningham.
**Derivation of Expectation Propagation for "fast Gaussian process methods for
point process intensity estimation"**.
Technical report, Stanford University, 2008.

** Abstract:** We
derive the Expectation Propagation algorithm updates for approximating the
posterior distribution on intensity in a conditionally inhomogeneous gamma
interval process with a Gaussian Process prior (GP IGIP), a model which
appeared in Cunningham, Shenoy, Sahani (2008) ICML.

J. P. Cunningham, K. V. Shenoy, and M. Sahani.
**Fast
Gaussian process methods for point process intensity estimation**.
In *25th International Conference on Machine Learning*, pages 1-8,
Helsinki, Finland, June 2008.

** Abstract:** Point processes
are difficult to analyze because they provide only a sparse and noisy
observation of the intensity function driving the process. Gaussian Processes
offer an attractive framework within which to infer underlying intensity
functions. The result of this inference is a continuous function defined
across time that is typically more amenable to analytical efforts. However, a
naive implementation will become computationally infeasible in any problem of
reasonable size, both in memory and run time requirements. We demonstrate
problem specific methods for a class of renewal processes that eliminate the
memory burden and reduce the solve time by orders of magnitude.

J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani.
**Inferring neural firing rates from spike trains using Gaussian
processes**.
In *Advances in Neural Information Processing Systems 20*, pages 1-8,
Vancouver, BC, December 2008.

** Abstract:** Neural spike
trains present challenges to analytical efforts due to their noisy, spiking
nature. Many studies of neuroscientific and neural prosthetic importance rely
on a smoothed, denoised estimate of the spike train's underlying firing rate.
Current techniques to find time-varying firing rates require ad hoc choices
of parameters, offer no confidence intervals on their estimates, and can
obscure potentially important single trial variability. We present a new
method, based on a Gaussian Process prior, for inferring probabilistically
optimal estimates of firing rate functions underlying single or multiple
neural spike trains. We test the performance of the method on simulated data
and experimentally gathered neural spike trains, and we demonstrate
improvements over conventional estimators.

** Comment:** Spotlight Presentation

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen.
**Approximate
dynamic programming with Gaussian processes**.
In *2008 American Control Conference (ACC 2008)*, pages 4480-4485,
Seattle, WA, USA, June 2008.

** Abstract:** In general, it is
difficult to determine an optimal closed-loop policy in nonlinear control
problems with continuous-valued state and control domains. Hence,
approximations are often inevitable. The standard method of discretizing
states and controls suffers from the curse of dimensionality and strongly
depends on the chosen temporal sampling rate. The paper introduces Gaussian
Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman
recursion of the dynamic programming algorithm are modeled using Gaussian
processes. GPDP returns an optimal state-feedback for a finite set of states.
Based on these outcomes, we learn a possibly discontinuous closed-loop policy
on the entire state space by switching between two independently trained
Gaussian processes.

** Comment:** code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters.
**Model-based
reinforcement learning with continuous states and actions**.
In *Proceedings of the 16th European Symposium on Artificial Neural Networks
(ESANN 2008)*, pages 19-24, Bruges, Belgium, April 2008.

**
Abstract:** Finding an optimal policy in a reinforcement learning (RL)
framework with continuous state and action spaces is challenging. Approximate
solutions are often inevitable. GPDP is an approximate dynamic programming
algorithm based on Gaussian process (GP) models for the value functions. In
this paper, we extend GPDP to the case of unknown transition dynamics. After
building a GP model for the transition dynamics, we apply GPDP to this model
and determine a continuous-valued policy in the entire state space. We apply
the resulting controller to the underpowered pendulum swing up. Moreover, we
compare our results on this RL task to a nearly optimal discrete DP solution
in a fully known environment.

Finale Doshi and Nicholas Roy.
**Spoken language
interaction with model uncertainty: An adaptive human-robot interaction
system**.
*Connection Science*, 20(4):290-318, December 2008.

**
Abstract:** Spoken language is one of the most intuitive forms of
interaction between humans and agents. Unfortunately, agents that interact
with people using natural language often experience communication errors and
do not correctly understand the user's intentions. Recent systems have
successfully used probabilistic models of speech, language, and user behavior
to generate robust dialog performance in the presence of noisy speech
recognition and ambiguous language choices, but decisions made using these
probabilistic models are still prone to errors due to the complexity of
acquiring and maintaining a complete model of human language and behavior. In
this paper, we describe a decision-theoretic model for human-robot
interaction using natural language. Our algorithm is based on the Partially
Observable Markov Decision Process (POMDP), which allows agents to choose
actions that are robust not only to uncertainty from noisy or ambiguous
speech recognition but also unknown user models. Like most dialog systems, a
POMDP is defined by a large number of parameters that may be difficult to
specify a priori from domain knowledge, and learning these parameters from
the user may require an unacceptably long training period. We describe an
extension to the POMDP model that allows the agent to acquire a linguistic
model of the user online, including new vocabulary and word choice
preferences. Our approach not only avoids a training period of constant
questioning as the agent learns, but also allows the agent to actively query
for additional information when its uncertainty suggests a high risk of
mistakes. We demonstrate our approach both in simulation and on a natural
language interaction system for a robotic wheelchair application.

Richard J. Gibbens and Yunus Saatçi.
**Data, modelling and inference in road traffic networks**.
*Philosophical Transactions of the Royal Society A: Mathematical, Physical
and Engineering Sciences*, 366(1872):1907-1919, June 2008, doi
10.1098/rsta.2008.0020.

** Abstract:** In this paper, we
study UK road traffic data and explore a range of modelling and inference
questions that arise from them. For example, loop detectors on the M25
motorway record speed and flow measurements at regularly spaced locations as
well as the entry and exit lanes of junctions. An exploratory study of these
data helps us to better understand and quantify the nature of congestion on
the road network. From a traveller's perspective it is crucially important to
understand the overall journey times and we look at methods to improve our
ability to predict journey times given access jointly to both real-time and
historical loop detector data. Throughout this paper we will comment on
related work derived from US freeway data.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani.
**Statistical
models for partial membership**.
In Andrew McCallum and Sam Roweis, editors, *25th International Conference
on Machine Learning*, pages 392-399, Helsinki, Finland, July 2008.
Omnipress.

** Abstract:** We present a principled Bayesian
framework for modeling partial memberships of data points to clusters. Unlike
a standard mixture model which assumes that each data point belongs to one
and only one mixture component, or cluster, a partial membership model allows
data points to have fractional membership in multiple clusters. Algorithms
which assign data points partial memberships to clusters can be useful for
tasks such as clustering genes based on microarray data (Gasch & Eisen,
2002). Our Bayesian Partial Membership Model (BPM) uses exponential family
distributions to model each cluster, and a product of these distibtutions,
with weighted parameters, to model each datapoint. Here the weights
correspond to the degree to which the datapoint belongs to each cluster. All
parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to
perform inference and learning. We discuss relationships between the BPM and
Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA,
and fuzzy clustering. Lastly, we show some experimental results and discuss
nonparametric extensions to our model.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani.
**Metropolis
algorithms for representative subgraph sampling**.
In *Proceedings of 8th IEEE International Conference on Data Mining (ICDM
2008)*, pages 283-292, Pisa, Italy, December 2008. IEEE.
ISSN: 1550-4786.

** Abstract:** While data mining in
chemoinformatics studied graph data with dozens of nodes, systems biology and
the Internet are now generating graph data with thousands and millions of
nodes. Hence data mining faces the algorithmic challenge of coping with this
significant increase in graph size: Classic algorithms for data analysis are
often too expensive and too slow on large graphs.

While one strategy to
overcome this problem is to design novel efficient algorithms, the other is
to 'reduce' the size of the large graph by sampling. This is the scope of
this paper: We will present novel Metropolis algorithms for sampling a
'representative' small subgraph from the original large graph, with
'representative' describing the requirement that the sample shall preserve
crucial graph properties of the original graph. In our experiments, we
improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by
producing representative subgraph samples that are both smaller and of higher
quality than those produced by other methods from the literature.

H. Kim and Zoubin Ghahramani.
**Outlier robust
Gaussian process classification**.
In L. Niels da Vitoria, editor, *Structural, Syntactic and Statistical
Pattern Recognition*, volume 5342 of *Lecture Notes in Computer
Science (LNCS)*, pages 896-905, Berlin, Germany, December 2008. Springer
Berlin / Heidelberg.

** Abstract:** Gaussian process
classifiers (GPCs) are a fully statistical model for kernel classification.
We present a form of GPC which is robust to labeling errors in the data set.
This model allows label noise not only near the class boundaries, but also
far from the class boundaries which can result from mistakes in labelling or
gross errors in measuring the input features. We derive an outlier robust
algorithm for training this model which alternates iterations based on the EP
approximation and hyperparameter updates until convergence. We show the
usefulness of the proposed algorithm with model selection method through
simulation results.

Hannes Nickisch and Carl Edward Rasmussen.
**Approximations for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 9:2035-2078, October 2008.

** Abstract:** We provide a comprehensive overview of many
recent algorithms for approximate inference in Gaussian process models for
probabilistic binary classification. The relationships between several
approaches are elucidated theoretically, and the properties of the different
algorithms are corroborated by experimental results. We examine both 1) the
quality of the predictive distributions and 2) the suitability of the
different marginal likelihood approximations for model selection (selecting
hyperparameters) and compare to a gold standard based on MCMC. Interestingly,
some methods produce good predictive distributions although their marginal
likelihood approximations are poor. Strong conclusions are drawn about the
methods: The Expectation Propagation algorithm is almost always the method of
choice unless the computational budget is very tight. We also extend existing
methods in various ways, and provide unifying code implementing all
approaches.

Carl Edward Rasmussen and Marc Peter Deisenroth.
**Probabilistic
inference for fast learning in control**.
In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, *Recent
Advances in Reinforcement Learning*, volume 5323 of *Lecture Notes in
Computer Science (LNCS)*, pages 229-242, Villeneuve d'Ascq, France,
November 2008. Springer-Verlag.

** Abstract:** We provide a
novel framework for very fast model-based reinforcement learning in
continuous state and action spaces. The framework requires probabilistic
models that explicitly characterize their levels of confidence. Within this
framework, we use flexible, non-parametric models to describe the world based
on previously collected experience. We demonstrate learning on the cart-pole
problem in a setting where we provide very limited prior knowledge about the
task. Learning progresses rapidly, and a good policy is found after only a
hand-full of iterations.

** Comment:** videos and more. slides.

R. Silva, W. Chu, and Zoubin Ghahramani.
**Hidden common
cause relations in relational learning**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *Advances in
Neural Information Processing Systems 20*, pages 1345-1352, Cambridge,
MA, USA, December 2008. The MIT Press.

** Abstract:** When
predicting class labels for objects within a relational database, it is often
helpful to consider a model for relationships: this allows for information
between class labels to be shared and to improve prediction performance.
However, there are different ways by which objects can be related within a
relational database. One traditional way corresponds to a Markov network
structure: each existing relation is represented by an undirected edge. This
encodes that, conditioned on input features, each object label is independent
of other object labels given its neighbors in the graph. However, there is no
reason why Markov networks should be the only representation of choice for
symmetric dependence structures. Here we discuss the case when relationships
are postulated to exist due to *hidden common causes*. We discuss how
the resulting graphical model differs from Markov networks, and how it
describes different types of real-world relational processes. A Bayesian
nonparametric classification model is built upon this graphical
representation and evaluated with several empirical studies.

** Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/xgp

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Latent space
variational Bayes**.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,
30(12):2236-2242, November 2008.

** Abstract:** Variational
Bayesian Expectation-Maximization (VBEM), an approximate inference method for
probabilistic models based on factorizing over latent variables and model
parameters, has been a standard technique for practical Bayesian inference.
In this paper, we introduce a more general approximate inference framework
for conjugate-exponential family models, which we call Latent-Space
Variational Bayes (LSVB). In this approach, we integrate out the model
parameters in an exact way, leaving only the latent variables. It can be
shown that the LSVB approach gives better estimates of the model evidence as
well as the distribution over the latent variables than the VBEM approach,
but, in practice, the distribution over the latent variables has to be
approximated. As a practical implementation, we present a First-order LSVB
(FoLSVB) algorithm to approximate the distribution over the latent variables.
From this approximate distribution, one can also estimate the model evidence
and the posterior over the model parameters. The FoLSVB algorithm is directly
comparable to the VBEM algorithm and has the same computational complexity.
We discuss how LSVB generalizes the recently proposed collapsed variational
methods to general conjugate-exponential families. Examples based on mixtures
of Gaussians and mixtures of Bernoullis with synthetic and real-world data
sets are used to illustrate some advantages of our method over VBEM.

J.M. Sung, Z. Ghahramani, and S.Y. Bang.
**Second-order
latent space variational Bayes for approximate Bayesian
inference**.
*IEEE Signal Processing Letters*, 15:918-921, December 2008.

** Abstract:** In this letter, we consider a variational
approximate Bayesian inference framework, latent-space variational Bayes
(LSVB), in the general context of conjugate-exponential family models with
latent variables. In the LSVB approach, we integrate out model parameters in
an exact way and then perform the variational inference over only the latent
variables. It can be shown that LSVB can achieve better estimates of the
model evidence as well as the distribution over the latent variables than the
popular variational Bayesian expectation-maximization (VBEM). However, the
distribution over the latent variables in LSVB has to be approximated in
practice. As an approximate implementation of LSVB, we propose a second-order
LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of
a first-order approximation in LSVB. SoLSVB can capture higher order
statistics neglected in VBEM and can therefore achieve a better
approximation. Examples of Gaussian mixture models are used to illustrate the
comparison between our method and VBEM, demonstrating the improvement.

Richard E. Turner and Maneesh Sahani.
**Modeling natural sounds with modulation cascade processes**.
In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, *nips20*,
volume 20. mit, 2008.

** Abstract:** Natural sounds are
structured on many time-scales. A typical segment of speech, for example,
contains features that span four orders of magnitude: Sentences ($\sim1$s);
phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants
($\sim 10$−$3$s). The auditory system uses information from each of these
time-scales to solve complicated tasks such as auditory scene analysis [1].
One route toward understanding how auditory processing accomplishes this
analysis is to build neuroscience-inspired algorithms which solve similar
tasks and to compare the properties of these algorithms with properties of
auditory processing. There is however a discord: Current machine-audition
algorithms largely concentrate on the shorter time-scale structures in
sounds, and the longer structures are ignored. The reason for this is
two-fold. Firstly, it is a difficult technical problem to construct an
algorithm that utilises both sorts of information. Secondly, it is
computationally demanding to simultaneously process data both at high
resolution (to extract short temporal information) and for long duration (to
extract long temporal information). The contribution of this work is to
develop a new statistical model for natural sounds that captures structure
across a wide range of time-scales, and to provide efficient learning and
inference algorithms. We demonstrate the success of this approach on a
missing data task.

J. Van Gael, Y.W. Teh, and Z. Ghahramani.
**The infinite
factorial hidden Markov model**.
In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, *Advances in
Neural Information Processing Systems 21*, volume 21, pages
1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

** Abstract:** The infinite factorial hidden Markov model is a
non-parametric extension of the factorial hidden Markov model. Our model
defines a probability distribution over an infinite number of independent
binary hidden Markov chains which together produce an observable sequence of
random variables. Central to our model is a new type of non-parametric prior
distribution inspired by the Indian Buffet Process which we call the
*Indian Buffet Markov Process*.

A. Vlachos, Z. Ghahramani, and A Korhonen.
**Dirichlet
process mixture models for verb clustering**.
In Guillaume Bouchard, Hal Daumé III, Marc Dymetman, and Yee Whye Teh,
editors, *ICML Workshop on Prior Knowledge for Text and Language
Processing*, pages 43-48, Helsinki, Finland, July 2008.

**
Abstract:** In this work we apply Dirichlet Process Mixture Models to a
learning task in natural language processing (NLP): lexical-semantic verb
clustering. We assess the performance on a dataset based on Levin's (1993)
verb classes using the recently introduced V-measure metric. In, we present a
method to add human supervision to the model in order to to influence the
solution with respect to some prior knowledge. The quantitative evaluation
performed highlights the benefits of the chosen method compared to previously
used clustering approaches.

Sinead Williamson and Zoubin Ghahramani.
**Probabilistic models
for data combination in recommender systems**.
In *Learning from Multiple Sources Workshop, NIPS Conference*, Whistler
Canada, 2008.

J. Zhang, Z. Ghahramani, and Y. Yang.
**Flexible latent
variable models for multi-task learning**.
*Machine Learning*, 73(3):221-242, December 2008.

**
Abstract:** Given multiple prediction problems such as regression and
classification, we are interested in a joint inference framework which can
effectively borrow information among tasks to improve the prediction
accuracy, especially when the number of training examples per problem is
small. In this paper we propose a probabilistic framework which can support a
set of latent variable models for different multi-task learning scenarios. We
show that the framework is a generalization of standard learning methods for
single prediction problems and it can effectively model the shared structure
among different prediction tasks. Furthermore, we present efficient
algorithms for the empirical Bayes method as well as point estimation. Our
experiments on both simulated datasets and real world classification datasets
show the effectiveness of the proposed models in two evaluation settings:
standard multi-task learning setting and transfer learning setting.

Arik Azran.
**The Rendezvous
algorithm: multiclass semi-supervised learning with Markov random
walks**.
In Zoubin Ghahramani, editor, *24th International Conference on Machine
Learning*, pages 49-56, Corvallis, OR, USA, June 2007. Omnipress.

** Abstract:** We consider the problem of multiclass
classification where both labeled and unlabeled data points are given. We
introduce and demonstrate a new approach for estimating a distribution over
the missing labels where data points are viewed as nodes of a graph, and
pairwise similarities are used to derive a transition probability matrix P
for a Markov random walk between them. The algorithm associates each point
with a particle which moves between points according to P. Labeled points are
set to be absorbing states of the Markov random walk, and the probability of
each particle to be absorbed by the different labeled points, as the number
of steps increases, is then used to derive a distribution over the associated
missing label. A computationally efficient algorithm to implement this is
derived and demonstrated on both real and artificial data sets, including a
numerical comparison with other methods.

W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi.
**Relational
learning with Gaussian processes**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, volume 19 of *Bradford
Books*, pages 289-296, Cambridge, MA, USA, September 2007. The MIT
Press.
Online contents gives pages 314-321, and 289-296 on pdf of contents.

** Abstract:** Correlation between instances is often modelled
via a kernel function using input attributes of the instances. Relational
knowledge can further reveal additional pairwise correlations between
variables of interest. In this paper, we develop a class of models which
incorporates both reciprocal relational information and input attributes
using Gaussian process techniques. This approach provides a novel
non-parametric Bayesian framework with a data-dependent prior for supervised
learning tasks. We also apply this framework to semi-supervised learning.
Experimental results on several real world data sets verify the usefulness of
this algorithm.

Z. Ghahramani, T.L. Griffiths, and P. Sollich.
**Bayesian
nonparametric latent feature models (with discussion)**.
In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M.
Smith, and M. West, editors, *Bayesian Statistics 8*, pages 201-226,
Oxford, UK, July 2007. Oxford University Press.

** Abstract:**
We describe a flexible nonparametric approach to latent variable modelling in
which the number of latent variables is unbounded. This approach is based on
a probability distribution over equivalence classes of binary matrices with a
finite number of rows, corresponding to the data points, and an unbounded
number of columns, corresponding to the latent variables. Each data point can
be associated with a subset of the possible latent variables, which we refer
to as the latent features of that data point. The binary variables in the
matrix indicate which latent feature is possessed by which data point, and
there is a potentially infinite array of features. We derive the distribution
over unbounded binary matrices by taking the limit of a distribution over
N×K binary matrices as K→∞. We define a simple generative
processes for this distribution which we call the Indian buffet process (IBP;
Griffiths and Ghahramani, 2005, 2006) by analogy
to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a
single hyperparameter which controls both the number of feature per ob ject
and the total number of features. We describe a two-parameter generalization
of the IBP which has additional flexibility, independently controlling the
number of features per object and the total number of features in the matrix.
The use of this distribution as a prior in an infinite latent feature model
is illustrated, and Markov chain Monte Carlo algorithms for inference are
described.

** Comment:** Includes discussion by David Dunson, and
rejoinder.

A.B. Goldberg, X. Zhu, J. Van Gael, and D. Andrzejewski.
**Improving
diversity in ranking using absorbing random walks**.
In *Proceedings of NAACL HLT*, pages 97-104, Rochester, NY, USA, April
2007.

David Knowles and Zoubin Ghahramani.
**Infinite sparse
factor analysis and infinite independent components analysis**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 381-388, London, UK, September 2007. Springer,
doi
10.1007/978-3-540-74494-8_48.

** Abstract:** A
nonparametric Bayesian extension of Independent Components Analysis (ICA) is
proposed where observed data Y is modelled as a linear superposition, G, of a
potentially infinite number of hidden sources, X. Whether a given source is
active for a specific data point is specified by an infinite binary matrix,
Z. The resulting sparse representation allows increased data reduction
compared to standard ICA. We define a prior on Z using the Indian Buffet
Process (IBP). We describe four variants of the model, with Gaussian or
Laplacian priors on X and the one or two-parameter IBPs. We demonstrate
Bayesian inference under these models using a Markov chain Monte Carlo (MCMC)
algorithm on synthetic and gene expression data and compare to standard ICA
algorithms.

E. Meeds, Z. Ghahramani, R. Neal, and S.T. Roweis.
**Modelling
dyadic data with binary latent factors**.
In B. Schölkopf, J. Platt, and T. Hofmann, editors, *Advances in Neural
Information Processing Systems 19*, Bradford Books, pages 977-984,
Cambridge, MA, USA, September 2007. The MIT Press.
Online contents gives pages 1002-1009, and 977-984 on pdf contents.

** Abstract:** We introduce binary matrix factorization, a novel
model for unsupervised matrix decomposition. The decomposition is learned by
fitting a non-parametric Bayesian probabilistic model with binary latent
variables to a matrix of dyadic data. Unlike bi-clustering models, which
assign each row or column to a single cluster based on a categorical hidden
feature, our binary feature model reflects the prior belief that items and
attributes can be associated with more than one latent cluster at a time. We
provide simple learning and inference rules for this new model and show how
to extend it to an infinite model in which the number of features is not a
priori fixed but is allowed to grow with the size of the data.

F. Pérez-Cruz, Zoubin Ghahramani, and M. Pontil.
**Conditional
graphical models**.
In G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and
S. V. N. Vishwanathan, editors, *Predicting Structured Data*, pages
265-282. The MIT Press, Cambridge, MA, USA, September 2007.
Chapter 12.

** Abstract:** In this chapter we propose a
modification of CRF-like algorithms that allows for solving large-scale
structured classification problems. Our approach consists in upper bounding
the CRF functional in order to decompose its training into independent
optimisation problems per clique. Furthermore we show that each sub-problem
corresponds to solving a multiclass learning task in each clique, which
enlarges the applicability of these tools for large-scale structural learning
problems. Before presenting the Conditional Graphical Model (CGM), as we
refer to this procedure, we review the family of CRF algorithms. We
concentrate on the best known procedures and standard generalisations of
CRFs. The ob jective of this introduction is analysing from the same
viewpoint the proposed solutions in the literature to tackle this problem,
which allows comparing their different features. We complete the chapter with
a case study, in which we show the possibility to work with large-scale
problems using CGM and that the obtained performance is comparable to the
result with CRF-like algorithms.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Christopher K. I.
Williams.
**Approximation
methods for Gaussian process regression**.
In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, *Large-Scale
Kernel Machines*, Neural Information Processing, pages 203-223. The
MIT Press, Cambridge, MA, USA, September 2007.

**
Abstract:** A wealth of computationally efficient approximation methods for
Gaussian process regression have been recently proposed. We give a unifying
overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a
brief review of approximate matrix-vector multiplication methods.

** Comment:** book

Edward Snelson and Zoubin Ghahramani.
**Local and global
sparse Gaussian process approximations**.
In M. Meila and X. Shen, editors, *11th International Conference on
Artificial Intelligence and Statistics*. Omnipress, 2007.

**
Abstract:** Gaussian process (GP) models are flexible probabilistic
nonparametric models for regression, classification and other tasks.
Unfortunately they suffer from computational intractability for large data
sets. Over the past decade there have been many different approximations
developed to reduce this cost. Most of these can be termed global
approximations, in that they try to summarize all the training data via a
small set of support points. A different approach is that of local
regression, where many local experts account for their own part of space. In
this paper we start by investigating the regimes in which these different
approaches work well or fail. We then proceed to develop a new sparse GP
approximation which is a combination of both the global and local approaches.
Theoretically we show that it is derived as a natural extension of the
framework developed by Quiñonero-Candela and
Rasmussen for sparse GP approximations. We demonstrate the benefits of
the combined approximation on some 1D examples for illustration, and on some
large real-world data sets.

Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou,
Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira,
Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander
Smola, Pascal Vincent, Jason Weston, and Robert C. Williamson.
**The need for open source software in machine learning**.
*Journal of Machine Learning Research*, 8:2443-2466, October 2007.

** Abstract:** Open source tools have recently reached a level
of maturity which makes them suitable for building large-scale real-world
systems. At the same time, the field of machine learning has developed a
large body of powerful learning algorithms for diverse applications. However,
the true potential of these methods is not realized, since existing
implementations are not openly shared, resulting in software with low
usability, and weak interoperability. We argue that this situation can be
significantly improved by increasing incentives for researchers to publish
their software under an open source model. Additionally, we outline the
problems authors are faced with when trying to publish algorithmic
implementations of machine learning methods. We believe that a resource of
peer reviewed software accompanied by short articles would be highly valuable
to both the machine learning and the general scientific community.

Richard E. Turner and M Sahani.
**Probabilistic amplitude demodulation**.
In *7th International Conference on Independent Component Analysis and
Signal Separation*, pages 544-551, 2007.

** Abstract:**
Auditory scene analysis is extremely challenging. One approach, perhaps that
adopted by the brain, is to shape useful representations of sounds on prior
knowledge about their statistical structure. For example, sounds with
harmonic sections are common and so time-frequency representations are
efficient. Most current representations concentrate on the shorter
components. Here, we propose representations for structures on longer
time-scales, like the phonemes and sentences of speech. We decompose a sound
into a product of processes, each with its own characteristic time-scale.
This demodulation cascade relates to classical amplitude demodulation, but
traditional algorithms fail to realise the representation fully. A new
approach, probabilistic amplitude demodulation, is shown to out-perform the
established methods, and to easily extend to representation of a full
demodulation cascade.

Richard E. Turner and Maneesh Sahani.
**A maximum-likelihood interpretation for slow feature
analysis**.
*nc*, 19(4):1022-1038, 2007, doi
http://dx.doi.org/10.1162/neco.2007.19.4.1022.

**
Abstract:** The brain extracts useful features from a maelstrom of sensory
information, and a fundamental goal of theoretical neuroscience is to work
out how it does so. One proposed feature extraction strategy is motivated by
the observation that the meaning of sensory data, such as the identity of a
moving visual object, is often more persistent than the activation of any
single sensory receptor. This notion is embodied in the slow feature analysis
(SFA) algorithm, which uses “slowness” as an heuristic by which to
extract semantic information from multi-dimensional time-series. Here, we
develop a probabilistic interpretation of this algorithm showing that
inference and learning in the limiting case of a suitable probabilistic model
yield exactly the results of SFA. Similar equivalences have proved useful in
interpreting and extending comparable algorithms such as independent
component analysis. For SFA, we use the equivalent probabilistic model as a
conceptual spring-board, with which to motivate several novel extensions to
the algorithm.

J. Van Gael and X. Zhu.
**Correlation
clustering for crosslingual link detection**.
In Manuela M. Veloso, editor, *International Joint Conference on Artificial
Intelligence (IJCAI)*, pages 1744-1749, Hyderabad, India, January
2007.

** Abstract:** The crosslingual link detection problem
calls for identifying news articles in multiple languages that report on the
same news event. This paper presents a novel approach based on constrained
clustering. We discuss a general way for constrained clustering using a
recent, graph-based clustering framework called correlation clustering. We
introduce a correlation clustering implementation that features linear
program chunking to allow processing larger datasets. We show how to apply
the correlation clustering algorithm to the crosslingual link detection
problem and present experimental results that show correlation clustering
improves upon the hierarchical clustering approaches commonly used in link
detection, and, hierarchical clustering approaches that take constraints into
account.

Arik Azran and Zoubin Ghahramani.
**A new approach to
data driven clustering**.
In William Cohen and Andrew Moore, editors, *23rd International Conference
on Machine Learning*, pages 57-64, Pittsburgh, PA, USA, June 2006.
Omnipress.

** Abstract:** We consider the problem of
clustering in its most basic form where only a local metric on the data space
is given. No parametric statistical model is assumed, and the number of
clusters is learned from the data. We introduce, analyze and demonstrate a
novel approach to clustering where data points are viewed as nodes of a
graph, and pairwise similarities are used to derive a transition probability
matrix P for a Markov random walk between them. The algorithm automatically
reveals structure at increasing scales by varying the number of steps taken
by this random walk. Points are represented as rows of Pt, which are the
t-step distributions of the walk starting at that point; these distributions
are then clustered using a KL-minimizing iterative algorithm. Both the number
of clusters, and the number of steps that best reveal it, are found by
optimizing spectral properties of P.

Arik Azran and Zoubin Ghahramani.
**Spectral methods
for automatic multiscale data clustering**.
In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*,
pages 190-197, New York, NY, USA, June 2006. IEEE Computer Society, doi
10.1109/CVPR.2006.289.

** Abstract:** Spectral clustering
is a simple yet powerful method for finding structure in data using spectral
properties of an associated pairwise similarity matrix. This paper provides
new insights into how the method works and uses these to derive new
algorithms which given the data alone automatically learn different plausible
data partitionings. The main theoretical contribution is a generalization of
a key result in the field, the multicut lemma (Meila 2001). We use this
generalization to derive two algorithms. The first uses the eigenvalues of a
given affinity matrix to infer the number of clusters in data, and the second
combines learning the affinity matrix with inferring the number of clusters.
A hierarchical implementation of the algorithms is also derived. The
algorithms are theoretically motivated and demonstrated on nontrivial data
sets.

Zoubin Ghahramani and Katherine A. Heller.
**Bayesian
sets**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 435-442, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** Inspired by
"Google™ Sets", we consider the problem of retrieving items from a
concept or cluster, given a query consisting of a few items from that
cluster. We formulate this as a Bayesian inference problem and describe a
very simple algorithm for solving it. Our algorithm uses a model-based
concept of a cluster and ranks items using a score which evaluates the
marginal probability that each item belongs to a cluster containing the query
items. For exponential family models with conjugate priors this marginal
probability is a simple function of sufficient statistics. We focus on sparse
binary data and show that our score can be evaluated exactly using a single
sparse matrix multiplication, making it possible to apply our algorithm to
very large datasets. We evaluate our algorithm on three datasets: retrieving
movies from EachMovie, finding completions of author sets from the NIPS
dataset, and finding completions of sets of words appearing in the Grolier
encyclopedia. We compare to Google™ Sets and show that Bayesian Sets
gives very reasonable set completions.

Dilan Görür, Frank Jäkel, and Carl Edward Rasmussen.
**A choice model
with infinitely many latent features**.
In W. W. Cohen and Andrew Moore, editors, *23rd International Conference on
Machine Learning*, pages 361-368, New York, NY, USA, June 2006. ACM
Press, doi
10.1145/1143844.1143890.

** Abstract:** Elimination by
aspects (EBA) is a probabilistic choice model describing how humans decide
between several options. The options from which the choice is made are
characterized by binary features and associated weights. For instance, when
choosing which mobile phone to buy the features to consider may be: long
lasting battery, color screen, etc. Existing methods for inferring the
parameters of the model assume pre-specified features. However, the features
that lead to the observed choices are not always known. Here, we present a
non-parametric Bayesian model to infer the features of the options and the
corresponding weights from choice data. We use the Indian buffet process
(IBP) as a prior over the features. Inference using Markov chain Monte Carlo
(MCMC) in conjugate IBP models has been previously described. The main
contribution of this paper is an MCMC algorithm for the EBA model that can
also be used in inference for other non-conjugate IBP models-this may
broaden the use of IBP priors considerably.

A. B. Goldberg, D. Andrzejewski, J. Van Gael, B. Settles, X. Zhu, and
M. Craven.
**Ranking
biomedical passages for relevance and diversity: University of Wisconsin,
Madison at TREC Genomics 2006**.
In *Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006)*,
Gaithersburg, MD, USA, November 2006.

** Abstract:** We report
on the University of Wisconsin, Madison's experience in the TREC Genomics
2006 track, which asks participants to retrieve passages from scientific
articles that satisfy biologists' information needs. An emphasis is placed on
returning relevant passages that discuss different aspects of the topic.
Using an off-the-shelf information retrieval (IR) engine, we focused on query
generation and reranking query results to encourage relevance and diversity.
For query generation, we automatically identify noun phrases from the topic
descriptions, and use online resources to gather synonyms as expansion terms.
Our first submission uses the baseline IR engine results. We rerank the
passages using a naive clustering-based approach in our second run, and we
test GRASSHOPPER, a novel graph-theoretic algorithm based on absorbing random
walks, in our third run. While our aspect-level results appear to compare
favorably with other participants on average, our query generation techniques
failed to produce adequate query results for several topics, causing our
passage and document-level evaluation scores to suffer. Furthermore, we
surprisingly achieved higher aspect-level scores using the initial ranking
than our methods aimed specifically at promoting diversity. While this sounds
discouraging, we have several ideas as to why this happened and hope to
produce new methods that correct these shortcomings.

T. L. Griffiths and Z. Ghahramani.
**Infinite latent
feature models and the Indian Buffet Process**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 475-482, Cambridge, MA, USA,
December 2006. The MIT Press.

** Abstract:** We define a
probability distribution over equivalence classes of binary matrices with a
finite number of rows and an unbounded number of columns. This distribution
is suitable for use as a prior in probabilistic models that represent objects
using a potentially infinite array of features. We identify a simple
generative process that results in the same distribution over equivalence
classes, which we call the Indian buffet process. We illustrate the use of
this distribution as a prior in an infinite latent feature model, deriving a
Markov chain Monte Carlo algorithm for inference in this model and applying
the algorithm to an image dataset.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximations for Gaussian process classification**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 699-706, Cambridge, MA, USA,
April 2006. The MIT Press.

** Abstract:** Gaussian processes
are attractive models for probabilistic classification but unfortunately
exact inference is analytically intractable. We compare Laplace's method and
Expectation Propagation (EP) focusing on marginal likelihood estimates and
predictive performance. We explain theoretically and corroborate empirically
that EP is superior to Laplace. We also compare to a sophisticated MCMC
scheme and show that EP is surprisingly accurate.

Tobias Pfingsten, Daniel Herrmann, and Carl Edward Rasmussen.
**Model-based design
analysis and yield optimization**.
*IEEE Transactions on Semiconductor Manufacturing*, 19(4):475-486,
2006, doi
10.1109/TSM.2006.883589.

** Abstract:** Fluctuations are
inherent to any fabrication process. Integrated circuits and
microelectromechanical systems are particularly affected by these variations,
and due to high-quality requirements the effect on the devices' perform ance
has to be understood quantitatively. In recent years, it has become possible
to model the performance of such complex systems on the basis of design
specifications, and model-based sensitivity analysis has made its way into
industrial engineering. We show how an efficient Bayesian approach, using a
Gaussian process prior, can replace the commonly used brute-force Monte Carlo
scheme, making it possible to apply the analysis to computationally costly
models. We introduce a number of global, statistically justified sensitivity
measures for design analysis and optimization. Two models of integrated
systems serve us as case studies to introduce the analysis and to assess its
convergence properties. We show that the Bayesian Monte Carlo scheme can save
costly simulation runs and can ensure a reliable accuracy of the
analysis.

** Comment:** Winner of the 2006 Best Paper Award for the
journal.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier
Bousquet, and Bernhard Schölkopf.
**Evaluating
predictive uncertainty challenge**.
In J. Quiñonero-Candela, I. Dagan, B. Magnini, and F. d'Alché-Buc,
editors, *Machine Learning Challenges. Evaluating predictive uncertainty,
visual object classification and recognising tectual entailment. First PASCAL
Machine Learning Challenges Workshop*, volume 3944 of *Lecture Notes
in Computer Science (LNCS)*, pages 1-27, Berlin, Germany, 04 2006.
Springer, doi
10.1007/11736790_1.

** Abstract:** This Chapter presents
the PASCAL1 Evaluating Predictive Uncertainty Challenge, introduces the
contributed Chapters by the participants who obtained outstanding results,
and provides a discussion with some lessons to be learnt. The Challenge was
set up to evaluate the ability of Machine Learning algorithms to provide good
"probabilistic predictions", rather than just the usual "point predictions"
with no measure of uncertainty, in regression and classification problems.
Participants had to compete on a number of regression and classification
tasks, and were evaluated by both traditional losses that only take into
account point predictions and losses we proposed that evaluate the quality of
the probabilistic predictions.

Carl Edward Rasmussen and Christopher K. I. Williams.
**Gaussian processes
for machine learning**.
The MIT Press, 2006.

** Abstract:** Gaussian processes (GPs)
provide a principled, practical, probabilistic approach to learning in kernel
machines. GPs have received increased attention in the machine-learning
community over the past decade, and this book provides a long-needed
systematic and unified treatment of theoretical and practical aspects of GPs
in machine learning. The treatment is comprehensive and self-contained,
targeted at researchers and students in machine learning and applied
statistics.

** Comment:** Winner of the 2009 DeGroot Prize. Book
web page, chapters and entire book
pdf. GPML Toolbox.

Edward Snelson and Zoubin Ghahramani.
**Sparse Gaussian
processes using pseudo-inputs**.
In Y. Weiss, B. Schölkopf, and J. Platt, editors, *Advances in Neural
Information Processing Systems 18*, pages 1257-1264. The MIT Press,
Cambridge, MA, 2006.

** Abstract:** We present a new Gaussian
process (GP) regression model whose covariance is parameterized by the the
locations of M pseudo-input points, which we learn by a gradient based
optimization. We take M<<N, where N is the number of real data points,
and hence obtain a sparse regression method which has O(NM^{2})
training cost and O(M^{2}) prediction cost per test case. We also
find hyperparameters of the covariance function in the same joint
optimization. The method can be viewed as a Bayesian regression model with
particular input dependent noise. The method turns out to be closely related
to several other sparse GP approaches, and we discuss the relation in detail.
We finally demonstrate its performance on some large data sets, and make a
direct comparison to other sparse GP methods. We show that our method can
match full GP performance with small M, i.e. very sparse solutions, and it
significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani.
**Variable
noise and dimensionality reduction for sparse Gaussian processes**.
In R. Dechter and T. S. Richardson, editors, *22nd Conference on Uncertainty
in Artificial Intelligence*. AUAI Press, 2006.

**
Abstract:** The sparse pseudo-input Gaussian process (SPGP) is a new
approximation method for speeding up GP regression in the case of a large
number of data points N. The approximation is controlled by the gradient
optimization of a small set of M pseudo-inputs, thereby reducing complexity
from O(N^{3}) to O(NM^{2}). One limitation of the SPGP is
that this optimization space becomes impractically big for high dimensional
data sets. This paper addresses this limitation by performing automatic
dimensionality reduction. A projection of the input space to a low
dimensional space is learned in a supervised manner, alongside the
pseudo-inputs, which now live in this reduced space. The paper also
investigates the suitability of the SPGP for modeling data with
input-dependent noise. A further extension of the model is made to make it
even more powerful in this regard - we learn an uncertainty parameter for
each pseudo-input. The combination of sparsity, reduced dimension, and
input-dependent noise makes it possible to apply GPs to much larger and more
complex data sets than was previously practical. We demonstrate the benefits
of these methods on several synthetic and real world problems.

Malte Kuß, Tobias Pfingsten, Lehel Csatò, and Carl Edward Rasmussen.
**Approximate
inference for robust Gaussian process regression**.
Technical Report 136, Max Planck Institute for Biological Cybernetics,
Tübingen, Germany, 2005.

** Abstract:** Gaussian process
(GP) priors have been successfully used in non-parametric Bayesian regression
and classification models. Inference can be performed analytically only for
the regression model with Gaussian noise. For all other likelihood models
inference is intractable and various approximation techniques have been
proposed. In recent years expectation-propagation (EP) has been developed as
a general method for approximate inference. This article provides a general
summary of how expectation-propagation can be used for approximate inference
in Gaussian process models. Furthermore we present a case study describing
its implementation for a new robust variant of Gaussian process regression.
To gain further insights into the quality of the EP approximation we present
experiments in which we compare to results obtained by Markov chain Monte
Carlo (MCMC) sampling.

Malte Kuß and Carl Edward Rasmussen.
**Assessing
approximate inference for binary Gaussian process classification**.
*Journal of Machine Learning Research*, 6:1679-1704, 2005.

** Abstract:** Gaussian process priors can be used to define
flexible, probabilistic classification models. Unfortunately exact Bayesian
inference is analytically intractable and various approximation techniques
have been proposed. In this work we review and compare Laplace's method and
Expectation Propagation for approximate Bayesian inference in the binary
Gaussian process classification model. We present a comprehensive comparison
of the approximations, their predictive performance and marginal likelihood
estimates to results obtained by MCMC sampling. We explain theoretically and
corroborate empirically the advantages of Expectation Propagation compared to
Laplace's method.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**A unifying view of sparse approximate Gaussian process
regression**.
*Journal of Machine Learning Research*, 6:1939-1959, 2005.

** Abstract:** We provide a new unifying view, including all
existing proper probabilistic sparse approximations for Gaussian process
regression. Our approach relies on expressing the effective prior which the
methods are using. This allows new insights to be gained, and highlights the
relationship between existing methods. It also allows for a clear
theoretically justified ranking of the closeness of the known approximations
to the corresponding full GPs. Finally we point directly to designs of new
better sparse approximations, combining the best of the existing strategies,
within attractive computational constraints.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen.
**Analysis of some
methods for reduced rank Gaussian process regression**.
In R. Murray-Smith and R. Shorten, editors, *Switching and Learning in
Feedback Systems*, pages 98-127. Springer, Berlin, Heidelberg, 2005.

** Abstract:** While there is strong motivation for using
Gaussian Processes (GPs) due to their excellent performance in regression and
classification problems, their computational complexity makes them
impractical when the size of the training set exceeds a few thousand cases.
This has motivated the recent proliferation of a number of cost-effective
approximations to GPs, both for classification and for regression. In this
paper we analyze one popular approximation to GPs for regression: the reduced
rank approximation. While generally GPs are equivalent to infinite linear
models, we show that Reduced Rank Gaussian Processes (RRGPs) are equivalent
to finite sparse linear models. We also introduce the concept of degenerate
GPs and show that they correspond to inappropriate priors. We show how to
modify the RRGP to prevent it from being degenerate at test time. Training
RRGPs consists both in learning the covariance function hyperparameters and
the support set. We propose a method for learning hyperparameters for a given
support set. We also review the Sparse Greedy GP (SGGP) approximation (Somla
and Bartlett, 2001), which is a way of learning the support set for given
hyperparameters based on approximating the posterior. We propose an
alternative method to the SGGP that has better generalization capabilities.
Finally we make experiments to compare the different ways of training a RRGP.
We provide some Matlab code for learning RRGPs.

Carl Edward Rasmussen and Joaquin Quiñonero-Candela.
**Healing the
Relevance Vector Machine through augmentation**.
In L. De Raedt and S. Wrobel, editors, *22nd International Conference on
Machine Learning*, pages 689-696, 2005.

** Abstract:**
The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel
method. It provides full predictive distributions for test cases. However,
the predictive uncertainties have the unintuitive property, that *they
get smaller the further you move away from the training cases*. We give a
thorough analysis. Inspired by the analogy to non-degenerate Gaussian
Processes, we suggest augmentation to solve the problem. The purpose of the
resulting model, RVM*, is primarily to corroborate the theoretical and
experimental analysis. Although RVM* could be used in practical applications,
it is no longer a truly sparse model. Experiments show that sparsity comes at
the expense of worse predictive distributions.

Edward Snelson and Zoubin Ghahramani.
**Compact
approximations to Bayesian predictive distributions**.
In *22nd International Conference on Machine Learning*, Bonn, Germany,
August 2005. Omnipress.

** Abstract:** We provide a general
framework for learning precise, compact, and fast representations of the
Bayesian predictive distribution for a model. This framework is based on
minimizing the KL divergence between the true predictive density and a
suitable compact approximation. We consider various methods for doing this,
both sampling based approximations, and deterministic approximations such as
expectation propagation. These methods are tested on a mixture of Gaussians
model for density estimation and on binary linear classification, with both
synthetic data sets for visualization and several real data sets. Our results
show significant reductions in prediction time and memory footprint.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and
David L. Wild.
**Clustering
protein sequence and structure space with infinite Gaussian mixture
models**.
In *Pacific Symposium on Biocomputing 2004*, pages 399-410, Singapore,
2004. World Scientific Publishing.

** Abstract:** We describe
a novel approach to the problem of automatically clustering protein sequences
and discovering protein families, subfamilies etc., based on the thoery of
infinite Gaussian mixture models. This method allows the data itself to
dictate how many mixture components are required to model it, and provides a
measure of the probability that two proteins belong to the same cluster. We
illustrate our methods with application to three data sets: globin sequences,
globin sequences with known tree-dimensional structures and G-pretein coupled
receptor sequences. The consistency of the clusters indicate that that our
methods is producing biologically meaningful results, which provide a very
good indication of the underlying families and subfamilies. With the
inclusion of secondary structure and residue solvent accessibility
information, we obtain a classification of sequences of known structure which
reflects and extends their SCOP classifications.

Jan Eichhorn, Andreas S. Tolias, Alexander Zien, Malte Kuß, Carl Edward
Rasmussen, Jason Weston, Nikos K. Logothetis, and Bernhard Schölkopf.
**Prediction on
spike data using kernel algorithms**.
In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors,
*Advances in Neural Information Processing Systems 16*,
volume 16, pages 1367-1374, Cambridge, MA, USA, 2004. The MIT
Press.

** Abstract:** We report and compare the performance of
different learning algorithms based on data from cortical recordings. The
task is to predict the orientation of visual stimuli from the activity of a
population of simultaneously recorded neurons. We compare several ways of
improving the coding of the input (i.e., the spike data) as well as of the
output (i.e., the orientation), and report the results obtained using
different kernel algorithms.

Matthias O. Franz, Younghee Kwon, Carl Edward Rasmussen, and Bernhard
Schölkopf.
**Semi-supervised kernel regression using whitened function classes**.
In C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, and B. Schölkopf,
editors, *Lecture Notes in Computer Science (LNCS)*, volume 3175,
pages 18-26, Berlin, Germany, 2004. Springer.

** Abstract:**
The use of non-orthonormal basis functions in ridge regression leads to an
often undesired non-isotropic prior in function space. In this study, we
investigate an alternative regularization technique that results in an
implicit whitening of the basis functions by penalizing directions in
function space with a large prior variance. The regularization term is
computed from unlabelled input data that characterizes the input
distribution. Tests on two datasets using polynomial basis functions showed
an improved average performance compared to standard ridge regression.

Dilan Görür, Carl Edward Rasmussen, Andreas S. Tolias, Fabian Sinz, and
Nikos K. Logothetis.
**Modelling
spikes with mixtures of factor analysers**.
In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, *DAGM 2004*, volume 3175 of *Lecture Notes in Computer
Science (LNCS)*, pages 391-398, Berlin, Germany, 09 2004. Springer.

** Abstract:** Identifying the action potentials of individual
neurons from extracellular recordings, known as spike sorting, is a
challenging problem. We consider the spike sorting problem using a generative
model,mixtures of factor analysers, which concurrently performs clustering
and feature extraction. The most important advantage of this method is that
it quantifies the certainty with which the spikes are classified. This can be
used as a means for evaluating the quality of clustering and therefore spike
isolation. Using this method, nearly simultaneously occurring spikes can also
be modelled which is a hard task for many of the spike sorting methods.
Furthermore, modelling the data with a generative model allows us to generate
simulated data.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe
Girard.
**Gaussian
process model based predictive control**.
In *American Control Conference*, pages 2214-2219, 2004.

** Abstract:** Gaussian process models provide a probabilistic
non-parametric modelling approach for black-box identi cation of non-linear
dynamic systems. The Gaussian processes can highlight areas of the input
space where prediction quality is poor, due to the lack of data or its
complexity, by indicating the higher variance around the predicted mean.
Gaussian process models contain noticeably less coef cients to be optimised.
This paper illustrates possible application of Gaussian process models within
model-based predictive control. The extra information provided within
Gaussian process model is used in predictive control, where optimisation of
control signal takes the variance information into account. The predictive
control principle is demonstrated on control of pH process benchmark.

Carl Edward Rasmussen.
**Gaussian processes in
machine learning**.
In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors,
*Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra,
Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003,
Revised Lectures*, volume 3176 of *Lecture Notes in Computer Science
(LNCS)*, pages 63-71. Springer-Verlag, Heidelberg, 2004.

**
Abstract:** We give a basic introduction to Gaussian Process regression
models. We focus on understanding the role of the stochastic process and how
it is used to define a distribution over functions. We present the simple
equations for incorporating training data and examine how to learn the
hyperparameters using the marginal likelihood. We explain the practical
advantages of Gaussian Process and end with conclusions and a look at the
current trends in GP work.

** Comment:** Copyright by Springer, springerlink

Carl Edward Rasmussen, Heinrich H. Bülthoff, Martin A. Giese, and Bernhard
Schölkopf, editors.
**Pattern Recognition: 26th DAGM
Symposium**, volume 3175 of *Lecture Notes in Computer Science
(LNCS)*, Berlin, Germany, August 2004. Springer, doi 10.1007/b99676.

Carl Edward Rasmussen and Malte Kuß.
**Gaussian processes
in reinforcement learning**.
In S. Thrun, L.K. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 751-759, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We exploit some
useful properties of Gaussian process (GP) regression models for
reinforcement learning in continuous state spaces and discrete time. We
demonstrate how the GP model allows evaluation of the value function in
closed form. The resulting policy iteration algorithm is demonstrated on a
simple problem with a two dimensional state space. Further, we speculate that
the intrinsic ability of GP models to characterise distributions of functions
would allow the method to capture entire distributions over future values
instead of merely their expectation, which has traditionally been the focus
of much of reinforcement learning.

Fabian Sinz, Joaquin Quiñonero-Candela, Gökhan H. Bakir, Carl Edward
Rasmussen, and Matthias O. Franz.
**Learning
depth from stereo**.
In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, *26th DAGM Symposium*, volume 3175 of *Lecture Notes in
Computer Science (LNCS)*, pages 245-252, Berlin, Germany, 09 2004.
Springer.

** Abstract:** We compare two approaches to the
problem of estimating the depth of a point in space from observing its image
position in two different cameras: 1. The classical photogrammetric approach
explicitly models the two cameras and estimates their intrinsic and extrinsic
parameters using a tedious calibration procedure; 2. A generic machine
learning approach where the mapping from image to spatial coordinates is
directly approximated by a Gaussian Process regression. Our results show that
the generic learning approach, in addition to simplifying the procedure of
calibration, can lead to higher depth accuracies than classical calibration
although no specific domain knowledge is used.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani.
**Warped Gaussian
processes**.
In S. Thrun, L. Saul, and B. Schölkopf, editors, *Advances in Neural
Information Processing Systems 16*, pages 337-344, Cambridge, MA, USA,
December 2004. The MIT Press.

** Abstract:** We generalise
the Gaussian process (GP) framework for regression by learning a nonlinear
transformation of the GP outputs. This allows for non-Gaussian processes and
non-Gaussian noise. The learning algorithm chooses a nonlinear transformation
such that transformed data is well-modelled by a GP. This can be seen as
including a preprocessing transformation as an integral part of the
probabilistic modelling problem, rather than as an ad-hoc step. We
demonstrate on several real regression problems that learning the
transformation can lead to significantly better performance than using a
regular GP, or a GP with a fixed transformation.

Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero-Candela, and
Roderick Murray-Smith.
**Gaussian
process priors with uncertain inputs - application to multiple-step ahead
time series forecasting**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 529-536, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We consider the
problem of multi-step ahead prediction in time series analysis using the
non-parametric Gaussian process model. k-step ahead forecasting of a
discrete-time non-linear dynamic system can be performed by doing repeated
one-step ahead predictions. For a state-space model of the form y_{t}
= f(y_{t-1},...,y_{t-L}), the prediction of y at time t + k
is based on the point estimates of the previous outputs. In this paper, we
show how, using an analytical Gaussian approximation, we can formally
incorporate the uncertainty about intermediate regressor values, thus
updating the uncertainty on the current prediction.

Juš Kocijan, Blaž Banko, Bojan Likar, Agathe Girard, Roderick
Murray-Smith, and Carl Edward Rasmussen.
**A case based
comparison of identification with neural network and Gaussian process
models**.
In *IFAC Internaltional Conference on Intelligent Control Systems and Signal
Processing*, volume 1, pages 137-142, 2003.

**
Abstract:** In this paper an alternative approach to black-box
identification of non-linear dynamic systems is compared with the more
established approach of using artificial neural networks. The Gaussian
process prior approach is a representative of non-parametric modelling
approaches. It was compared on a pH process modelling case study. The purpose
of modelling was to use the model for control design. The comparison revealed
that even though Gaussian process models can be effectively used for
modelling dynamic systems caution has to be axercised when signals are
selected.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Bojan
Likar.
**Predictive
control with Gaussian process models**.
In B. Zajc and M. Tkal, editors, *IEEE Region 8 Eurocon 2003: Computer as a
Tool*, pages 352-356, 2003.

** Abstract:** This paper
describes model-based predictive control based on Gaussian processes.
Gaussian process models provide a probabilistic non-parametric modelling
approach for black-box identification of non-linear dynamic systems. It
offers more insight in variance of obtained model response, as well as fewer
parameters to determine than other models. The Gaussian processes can
highlight areas of the input space where prediction quality is poor, due to
the lack of data or its complexity, by indicating the higher variance around
the predicted mean. This property is used in predictive control, where
optimisation of control signal takes the variance information into account.
The predictive control principle is demonstrated on a simulated example of
nonlinear system.

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, and Agathe
Girard.
**Adaptive,
cautious, predictive control with Gaussian process priors**.
In P. Van den Hof, B. Wahlberg, and S. Weiland, editors, *IFAC SYSID
2003*, pages 1195-1200, Oxford, UK, August 2003. Elsevier Science Ltd.

** Abstract:** Nonparametric Gaussian Process models, a Bayesian
statistics approach, are used to implement a nonlinear adaptive control law.
Predictions, including propagation of the state uncertainty are made over a
k-step horizon. The expected value of a quadratic cost function is minimised,
over this prediction horizon, without ignoring the variance of the model
predictions. The general method and its main features are illustrated on a
simulation example.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward
Rasmussen.
**Propagation of
uncertainty in Bayesian kernel models - application to multiple-step ahead
forecasting**.
In *ICASSP 2003*, volume 2, pages 701-704, April 2003.

** Abstract:** The object of Bayesian modelling is the
predictive distribution, which in a forecasting scenario enables improved
estimates of forecasted values and their uncertainties. In this paper we
focus on reliably estimating the predictive mean and variance of forecasted
values using Bayesian kernel based models such as the Gaussian Process and
the Relevance Vector Machine. We derive novel analytic expressions for the
predictive mean and variance for Gaussian kernel shapes under the assumption
of a Gaussian input distribution in the static case, and of a recursive
Gaussian predictive density in iterative forecasting. The capability of the
method is demonstrated for forecasting of time-series and compared to
approximate methods.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward
Rasmussen.
**Propagation of uncertainty in Bayesian
kernel models - application to multiple-step ahead forecasting**.
In C. Molina, T. Adali, J. Larsen, M. Van Hulle, S. C. Douglas, and J. Rouat,
editors, *NNSP 2003*, Piscataway, New Jersey, 2003. IEEE Press.

** Abstract:** The object of Bayesian modelling is the
predictive distribution, which in a forecasting scenario enables improved
estimates of forecasted values and their uncertainties. In this paper we
focus on reliably estimating the predictive mean and variance of forecasted
values using Bayesian kernel based models such as the Gaussian Process and
the Relevance Vector Machine. We derive novel analytic expressions for the
predictive mean and variance for Gaussian kernel shapes under the assumption
of a Gaussian input distribution in the static case, and of a recursive
Gaussian predictive density in iterative forecasting. The capability of the
method is demonstrated for forecasting of time-series and compared to
approximate methods.

** Comment:** Electronic version of Quiñonero-Candela, Girard, Larsen and
Rasmussen, 2003 which should have been presented at ICASSP 03, but was
cancelled due to bird flu epidemic.

Joaquin Quiñonero-Candela, Agathe Girard, and Carl Edward Rasmussen.
**Prediction at an
uncertain input for Gaussian processes and Relevance Vector Machines
application to multiple-step ahead time-series prediction**.
Technical Report IMM-2003-18, Instititute for Mathemetical Modelling, DTU,
2003.

** Comment:** techreport

Carl Edward Rasmussen.
**Gaussian processes to
speed up Hybrid Monte Carlo for expensive Bayesian integrals**.
In *Bayesian Statistics 7*, pages 651-659. Oxford University Press,
2003.

** Abstract:** Hybrid Monte Carlo (HMC) is often the
method of choice for computing Bayesian integrals that are not analytically
tractable. However the success of this method may require a very large number
of evaluations of the (un-normalized) posterior and its partial derivatives.
In situations where the posterior is computationally costly to evaluate, this
may lead to an unacceptable computational load for HMC. I propose to use a
Gaussian Process model of the (log of the) posterior for most of the
computations required by HMC. Within this scheme only occasional evaluation
of the actual posterior is required to guarantee that the samples generated
have exactly the desired distribution, even if the GP model is somewhat
inaccurate. The method is demonstrated on a 10 dimensional problem, where 200
evaluations suffice for the generation of 100 roughly independent points from
the posterior. Thus, the proposed scheme allows Bayesian treatment of models
with posteriors that are computationally demanding, such as models involving
computer simulation.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Bayesian Monte
Carlo**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 489-496, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** We investigate
Bayesian alternatives to classical Monte Carlo methods for evaluating
integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior
knowledge, such as smoothness of the integrand, into the estimation. In a
simple problem we show that this outperforms any classical importance
sampling method. We also attempt more challenging multidimensional integrals
involved in computing marginal likelihoods of statistical models (a.k.a.
partition functions and model evidences). We find that Bayesian Monte Carlo
outperformed Annealed Importance Sampling, although for very high dimensional
problems or problems with massive multimodality BMC may be less adequate. One
advantage of the Bayesian approach to Monte Carlo is that samples can be
drawn from any distribution. This allows for the possibility of active design
of sample points so as to maximise information gain.

Ercan Solak, Roderick Murray-Smith, William E. Leithead, Douglas Leith, and
Carl Edward Rasmussen.
**Derivative
observations in Gaussian process models of dynamic systems**.
In S. Becker, S. Thrun, and K. Obermayer, editors, *Advances in Neural
Information Processing Systems 15*, pages 1033-1040, Cambridge, MA, USA,
December 2003. The MIT Press.

** Abstract:** Gaussian
processes provide an approach to nonparametric modelling which allows a
straightforward combination of function and derivative observations in an
empirical model. This is of particular importance in identification of
nonlinear dynamic systems from experimental data. 1) It allows us to combine
derivative information, and associated uncertainty with normal function
observations into the learning and inference process. This derivative
information can be in the form of priors specified by an expert or identified
from perturbation data close to equilibrium. 2) It allows a seamless fusion
of multiple local linear models in a consistent manner, inferring consistent
models and ensuring that integrability constraints are met. 3) It improves
dramatically the computational efficiency of Gaussian process models for
dynamic system identification, by summarising large quantities of
near-equilibrium data by a handful of linearisations, reducing the training
set size - traditionally a problem for Gaussian process models.

Irene K. Andersen, Anna Szymkowiak, Carl Edward Rasmussen, L. G. Hanson, J. R.
Marstrand, H. B. W. Larsson, and Lars Kai Hansen.
**Perfusion
quantification using Gaussian process deconvolution**.
*Magnetic Resonance in Medicine*, 48(2):351-361, 2002, doi 10.1002/mrm.10213.

** Abstract:** The quantification of perfusion using dynamic
susceptibility contrast MR imaging requires deconvolution to obtain the
residual impulse-response function (IRF). Here, a method using a Gaussian
process for deconvolution, GPD, is proposed. The fact that the IRF is smooth
is incorporated as a constraint in the method. The GPD method, which
automatically estimates the noise level in each voxel, has the advantage that
model parameters are optimized automatically. The GPD is compared to singular
value decomposition (SVD) using a common threshold for the singular values
and to SVD using a threshold optimized according to the noise level in each
voxel. The comparison is carried out using artificial data as well as using
data from healthy volunteers. It is shown that GPD is comparable to SVD
variable optimized threshold when determining the maximum of the IRF, which
is directly related to the perfusion. GPD provides a better estimate of the
entire IRF. As the signal to noise ratio increases or the time resolution of
the measurements increases, GPD is shown to be superior to SVD. This is also
found for large distribution volumes.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen.
**The infinite
hidden Markov model**.
In *Advances in Neural Information Processing Systems 14*, pages
577-584, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We show that it is possible to extend hidden Markov models to
have a countably infinite number of hidden states. By using the theory of
Dirichlet processes we can implicitly integrate out the infinitely many
transition parameters, leaving only three hyperparameters which can be
learned from data. These three hyperparameters define a hierarchical
Dirichlet process capable of capturing a rich set of transition dynamics. The
three hyperparameters control the time scale of the dynamics, the sparsity of
the underlying state-transition matrix, and the expected number of distinct
hidden states in a finite sequence. In this framework it is also natural to
allow the alphabet of emitted symbols to be infinite - consider, for
example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Infinite mixtures of
Gaussian process experts**.
In *Advances in Neural Information Processing Systems 14*, pages
881-888, Cambridge, MA, USA, December 2002. The MIT Press.

**
Abstract:** We present an extension to the Mixture of Experts (ME) model,
where the individual experts are Gaussian Process (GP) regression models.
Using an input-dependent adaptation of the Dirichlet Process, we implement a
gating network for an infinite number of Experts. Inference in this model may
be done efficiently using a Markov Chain relying on Gibbs sampling. The model
allows the effective covariance function to vary with the inputs, and may
handle large datasets - thus potentially overcoming two of the biggest
hurdles with GP models. Simulations show the viability of this approach.

Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and
Volker Tresp.
**Observations
on the Nyström method for Gaussian process prediction**.
Technical report, University of Edinburgh, 2002.

** Abstract:**
A number of methods for speeding up Gaussian Process (GP) prediction have
been proposed, including the Nyström method of Williams and Seeger
(2001). In this paper we focus on two issues (1) the relationship of the
Nyström method to the Subset of Regressors method (Poggio and Girosi
1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the
Nyström approximation would be expected to provide a good approximation
to exact GP regression.

Carl Edward Rasmussen and Zoubin Ghahramani.
**Occam's
razor**.
In *Advances in Neural Information Processing Systems 13*, pages
294-300, Cambridge, MA, USA, December 2001. The MIT Press.

**
Abstract:** The Bayesian paradigm apparently only sometimes gives rise to
Occam's Razor; at other times very large models perform well. We give simple
examples of both kinds of behaviour. The two views are reconciled when
measuring complexity of functions, rather than of the machinery used to
implement them. We analyze the complexity of functions for some linear in the
parameter models that are equivalent to Gaussian Processes, and always find
Occam's Razor at work.

Lars Kai Hansen and Carl Edward Rasmussen.
**Pruning from
adaptive regularization**.
*Neural Computation*, 6(6):1222-1231, 1994.

**
Abstract:** Inspired by the recent upsurge of interest in Bayesian methods
we consider adaptive regularization. A generalization based scheme for
adaptation of regularization parameters is introduced and compared to
Bayesian regularization. We show that pruning arises naturally within both
adaptive regularization schemes. As model example we have chosen the simplest
possible: estimating the mean of a random variable with known variance.
Marked similarities are found between the two methods in that they both
involve a "noise limit", below which they regularize with infinite weight
decay, i.e., they prune. However, pruning is not always beneficial. We show
explicitly that both methods in some cases may increase the generalization
error. This corresponds to situations where the underlying assumptions of the
regularizer are poorly matched to the environment.

Pedro A. d. F. R. Højen-Sørensen, Carl Edward Rasmussen, and Lars Kai
Hansen.
**Bayesian
modelling of fMRI time series**.
In *Advances in Neural Information Processing Systems 12*, pages
754-760. The MIT Press, 2000.

** Abstract:** We present a
Hidden Markov Model (HMM) for inferring the hidden psychological state (or
neural activity) during single trial fMRI activation experiments with blocked
task paradigms. Inference is based on Bayesian methodology, using a
combination of analytical and a variety of Markov Chain Monte Carlo (MCMC)
sampling techniques. The advantage of this method is that detection of short
time learning effects between repeated trials is possible since inference is
based only on single trial experiments.

Carl Edward Rasmussen.
**The infinite Gaussian
mixture model**.
In *Advances in Neural Information Processing Systems 12*, pages
554-560. The MIT Press, 2000.

** Abstract:** In a Bayesian
mixture model it is not necessary a priori to limit the number of components
to be finite. In this paper an infinite Gaussian mixture model is presented
which neatly sidesteps the difficult problem of finding the "right" number of
mixture components. Inference in the model is done using an efficient
parameter-free Markov Chain that relies entirely on Gibbs sampling.

Carl Edward Rasmussen.
**A practical Monte
Carlo implementation of Bayesian learning**.
In *Advances in Neural Information Processing Systems 8*, pages
598-604, Cambridge, MA., USA, 1996. The MIT Press.

**
Abstract:** A practical method for Bayesian training of feed-forward neural
networks using sophisticated Monte Carlo methods is presented and evaluated.
In reasonably small amounts of computer time this approach outperforms other
state-of-the-art methods on 5 datalimited tasks from real world domains.

Carl Edward Rasmussen.
**Evaluation of
Gaussian processes and other methods for non-linear regression**.
PhD thesis, University of Toronto, Department of Computer Science, Toronto,
CANADA, 1996.

** Abstract:** This thesis develops two Bayesian
learning methods relying on Gaussian processes and a rigorous statistical
approach for evaluating such methods. In these experimental designs the
sources of uncertainty in the estimated generalisation performances due to
both variation in training and test sets are accounted for. The framework
allows for estimation of generalisation performance as well as statistical
tests of significance for pairwise comparisons. Two experimental designs are
recommended and supported by the DELVE software environment.

Two new
non-parametric Bayesian learning methods relying on Gaussian process priors
over functions are developed. These priors are controlled by hyperparameters
which set the characteristic length scale for each input dimension. In the
simplest method, these parameters are fit from the data using optimization.
In the second, fully Bayesian method, a Markov chain Monte Carlo technique is
used to integrate over the hyperparameters. One advantage of these Gaussian
process methods is that the priors and hyperparameters of the trained models
are easy to interpret.

The Gaussian process methods are benchmarked
against several other methods, on regression tasks using both real data and
data generated from realistic simulations. The experiments show that small
datasets are unsuitable for benchmarking purposes because the uncertainties
in performance measurements are large. A second set of experiments provide
strong evidence that the bagging procedure is advantageous for the
Multivariate Adaptive Regression Splines (MARS) method.

The simulated
datasets have controlled characteristics which make them useful for
understanding the relationship between properties of the dataset and the
performance of different methods. The dependency of the performance on
available computation time is also investigated. It is shown that a Bayesian
approach to learning in multi-layer perceptron neural networks achieves
better performance than the commonly used early stopping procedure, even for
reasonably short amounts of computation time. The Gaussian process methods
are shown to consistently outperform the more conventional methods.

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike
Revow, Zoubin Ghahramani, Rafal Kustra, and Robert Tibshirani.
**The DELVE
manual**, 1996.

** Abstract:** DELVE - Data for
Evaluating Learning in Valid Experiments - is a collection of datasets from
many sources, an environment within which this data can be used to assess the
performance of methods for learning relationships from data, and a repository
for the results of such experiments.

** Comment:** The delve website.

Carl Edward Rasmussen and David J. Willshaw.
**Presynaptic and postsynaptic
comptetition in models for the development of neuromuscular
connections**.
*Biological Cybernetics*, 68(5):409-419, 1993, doi 10.1007/BF00198773.

** Abstract:** In the establishment of connections between nerve
and muscle there is an initial stage when each muscle fibre is innervated by
several different motor axons. Withdrawal of connections then takes place
until each fibre has contact from just a single axon. The evidence suggests
that the withdrawal process involves competition between nerve terminals. We
examine in formal models several types of competitive mechanism that have
been proposed for this phenomenon. We show that a model which combines
competition for a presynaptic resource with competition for a postsynaptic
resource is superior to others. This model accounts for many anatomical and
physiological findings and has a biologically plausible implementation.
Intrinsic withdrawal appears to be a side effect of the competitive mechanism
rather than a separate non-competitive feature. The model's capabilities are
confirmed by theoretical analysis and full scale computer simulations.

Chris K. I. Williams and Carl Edward Rasmussen.
**Gaussian processes
for regression**.
In *Advances in Neural Information Processing Systems 8*, pages
514-520, Cambridge, MA., USA, 1996. The MIT Press.

**
Abstract:** The Bayesian analysis of neural networks is difficult because a
simple prior over weights implies a complex prior over functions. We
investigate the use of a Gaussian process prior over functions, which permits
the predictive Bayesian analysis for fixed values of hyperparameters to be
carried out exactly using matrix operations. Two methods, using optimization
and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested
on a number of challenging problems and have produced excellent results.