Publications, Machine Learning Group, Department of Engineering, Cambridge

[ GPs | Clustering | Graphical Models | MCMC | Semi-Supervised | Non-Parametric | Approximations | Bioinformatics | Information Retreival | RL and Control | Time Series | Network Modelling | Active Learning | Neuroscience | Signal Processing | Machine Vision | Machine Hearing | NLP | Deep Learning | Review ]
[ Bratières | Chen | Cunningham | Davies | Duvenaud | Eaton | Frellsen | Frigola | Van Gael | Gal | Ghahramani | Heaukulani | Heller | Hernández-Lobato | Hoffman | Houlsby | Huszár | Knowles | Lacoste-Julien | Lloyd | Lopez-Paz | McHutchon | Mohamed | Orbanz | Ortega | Palla | Quadrianto | Rasmussen | Roy | Saatçi | Steinrücken | Rich Turner | Ryan Turner | Snelson | Williamson | Wilson ]
[ 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | past millennia ]

Sébastien Bratières

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin Ghahramani. Scalable Gaussian Process structured prediction for grid factor graph applications. In 31st International Conference on Machine Learning, 2014.

Abstract: Structured prediction is an important and well studied problem with many applications across machine learning. GPstruct is a recently proposed structured prediction model that offers appealing properties such as being kernelised, non-parametric, and supporting Bayesian inference (Bratières et al. 2013). The model places a Gaussian process prior over energy functions which describe relationships between input variables and structured output variables. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners (predictors) trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani. Bayesian structured prediction using Gaussian processes. arXiv, abs/1307.3846, 2013.

Abstract: We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M3N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani. Scaling the iHMM: Parallelization versus Hadoop. In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, pages 1235-1240, Bradford, UK, 2010. IEEE Computer Society, doi 10.1109/CIT.2010.223.

Abstract: This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM Beal, Ghahramani and Rasmussen, 2002, in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program. We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.


Yutian Chen

Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In 31st International Conference on Machine Learning, pages 181-189, Beijing, China, June 2014.

Abstract: Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time.

Comment: supplementary


John Cunningham

E. Gilboa, Yunus Saatçi, and John P. Cunningham. Scaling multidimensional Gaussian processes using projected additive approximations. In 30th International Conference on Machine Learning, 2013.

Abstract: Exact Gaussian Process (GP) regression has O(N3) runtime for data size N, making it intractable for large N. Many algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure inherent in particular covariance functions, including GPs with implied Markov structure, and equispaced inputs (both enable O(N) runtime). However, these GP advances have not been extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests novel extensions of structured GPs to multidimensional inputs. We present new methods for additive GPs, showing a novel connection between the classic backfitting method and the Bayesian framework. To achieve optimal accuracy-complexity tradeoff, we extend this model with a novel variant of projection pursuit regression. Our primary result – projection pursuit Gaussian Process Regression – shows orders of magnitude speedup while preserving high accuracy. The natural second and third steps include non-Gaussian observations and higher dimensional equispaced grid methods. We introduce novel techniques to address both of these necessary directions. We thoroughly illustrate the power of these three advances on several datasets, achieving close performance to the naive Full GP at orders of magnitude less cost.

E. Gilboa, Yunus Saatçi, and John P. Cunningham. Scaling multidimensional inference for structured Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.

Comment: to appear

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham. Gpatt: Fast multidimensional pattern extrapolation with Gaussian processes. arXiv preprint arXiv:1310.5288, 2013.

Abstract: Gaussian processes are typically used for smoothing and interpolation on small datasets. We introduce a new Bayesian nonparametric framework - GPatt - enabling automatic pattern extrapolation with Gaussian processes on large multidimensional datasets. GPatt unifies and extends highly expressive kernels and fast exact inference techniques. Without human intervention - no hand crafting of kernel features, and no sophisticated initialisation procedures - we show that GPatt can solve large scale pattern extrapolation, inpainting, and kernel discovery problems, including a problem with 383,400 training points. We find that GPatt significantly outperforms popular alternative scalable Gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits model structure are useful in combination for modelling large scale multidimensional patterns.

John P. Cunningham, Zoubin Ghahramani, and Carl E. Rasmussen. Gaussian Processes for time-marked time-series data. In 15th International Conference on Artificial Intelligence and Statistics, 2012.

Abstract: In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data.

J. H. Macke, L. Busing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani. Empirical models of spiking in neural populations. In Advances in Neural Information Processing Systems 25, pages 1-8, Granada, Spain, December 2011.

Abstract: Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-of- fit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts.

B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Dynamical segmentation of single trials from population neural data. In Advances in Neural Information Processing Systems 25, pages 1-8, Granada, Spain, December 2011.

Abstract: Simultaneous recordings of many neurons embedded within a recurrently-connected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identified by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model - in which multiple linear dynamical laws approximate and nonlinear and potentially non-stationary dynamical process - is able to distinguish dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identified without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the firing rate of an isolated neuron based on the firing rates of others, suggesting that it captures more of the "Shared variance" of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reflect its computational role.

J. P. Cunningham, P. Nuyujukian, V. Gilja, C. A. Chestek, S. I. Ryu, and K. V. Shenoy. A closed-loop human simulator for investigating the role of feedback-control in brain-machine interfaces. Journal of Neurophysiology, 105:1932-1949, 2011.

Abstract: Neural prosthetic systems seek to improve the lives of severely disabled people by decoding neural activity into useful behavioral commands. These systems and their decoding algorithms are typically developed "offline", using neural activity previously gathered from a healthy animal, and the decoded movement is then compared with the true movement that accompanied the recorded neural activity. However, this offline design and testing may neglect important features of a real prosthesis, most notably the critical role of feedback control, which enables the user to adjust neural activity while using the prosthesis. We hypothesize that under- standing and optimally designing high-performance decoders require an experimental platform where humans are in closed-loop with the various candidate decode systems and algorithms. It remains unexplored the extent to which the subject can, for a particular decode system, algorithm, or parameter, engage feedback and other strategies to improve decode performance. Closed-loop testing may suggest different choices than offline analyses. Here we ask if a healthy human subject, using a closed-loop neural prosthesis driven by synthetic neural activity, can inform system design. We use this online pros- thesis simulator (OPS) to optimize "online" decode performance based on a key parameter of a current state-of-the-art decode algorithm, the bin width of a Kalman filter. First, we show that offline and online analyses indeed suggest different parameter choices. Previous literature and our offline analyses agree that neural activity should be analyzed in bins of 100- to 300-ms width. OPS analysis, which incorporates feedback control, suggests that much shorter bin widths (25-50 ms) yield higher decode performance. Second, we confirm this surprising finding using a closed-loop rhesus monkey prosthetic system. These findings illustrate the type of discovery made possible by the OPS, and so we hypothesize that this novel testing approach will help in the design of prosthetic systems that will translate well to human patients.

M. Zhao, A. P. Batista, J. P. Cunningham, C. A. Chestek, Z. Rivera-Alvidrez, R. Kalmar, S. I. Ryu, K. V. Shenoy, and S. Iyengar. An L1-regularized logistic model for detecting short-term neuronal interactions.. Journal of Computational Neuroscience, 2011, doi 10.1007/s10827-011-0365-5. In Press.

Abstract: Interactions among neurons are a key com- ponent of neural signal processing. Rich neural data sets potentially containing evidence of interactions can now be collected readily in the laboratory, but existing analysis methods are often not sufficiently sensitive and specific to reveal these interactions. Generalized linear models offer a platform for analyzing multi-electrode recordings of neuronal spike train data. Here we suggest an L1-regularized logistic regression model (L1L method) to detect short-term (order of 3ms) neuronal interactions. We estimate the parameters in this model using a coordinate descent algorithm, and determine the optimal tuning parameter using a Bayesian Information Criterion. Simulation studies show that in general the L1L method has better sensitivities and specificities than those of the traditional shuffle-corrected cross-correlogram (covariogram) method. The L1L method is able to detect excitatory interactions with both high sensitivity and specificity with reasonably large recordings, even when the magnitude of the interactions is small; similar results hold for inhibition given sufficiently high baseline firing rates. Our study also suggests that the false positives can be further removed by thresholding, because their magnitudes are typically smaller than true interactions. Simulations also show that the L1L method is somewhat robust to partially observed networks. We apply the method to multi-electrode recordings collected in the monkey dorsal premotor cortex (PMd) while the animal prepares to make reaching arm movements. The results show that some neurons interact differently depending on task conditions. The stronger interactions detected with our L1L method were also visible using the covariogram method.

M. M. Churchland, J. P. Cunningham, M. T. Kaufman, S. I. Ryu, and K. V. Shenoy. Cortical preparatory activity: Representation of movement or first cog in a dynamical machine?. Neuron, 68:387-400, 2010.

Abstract: The motor cortices are active during both movement and movement preparation. A common assumption is that preparatory activity constitutes a subthreshold form of movement activity: a neuron active during rightward movements becomes modestly active during preparation of a rightward movement. We asked whether this pattern of activity is, in fact, observed. We found that it was not: at the level of a single neuron, preparatory tuning was weakly correlated with movement-period tuning. Yet, somewhat paradoxically, preparatory tuning could be captured by a preferred direction in an abstract "space" that described the population-level pattern of movement activity. In fact, this relationship accounted for preparatory responses better than did traditional tuning models. These results are expected if preparatory activity provides the initial state of a dynamical system whose evolution produces movement activity. Our results thus suggest that preparatory activity may not represent specific factors, and may instead play a more mechanistic role.

M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, M. R. Cohen, G. S. Corrado, W. T. Newsome, A. M. Clark, P. Hosseini, B. B. Scott, D. C. Bradley, M. A. Smith, A. Kohn, J. A. Movshon, K. M. Armstrong, T. Moore, S. W. Chang, L. H. Snyder, S. G. Lisberger, N. J. Priebe, I. M. Finn, D. Ferster, S. I. Ryu, G. Santhanam, M. Sahani, and K. V. Shenoy. Stimulus onset quashes neural variability: a widespread cortical phenomenon. Nature Neuroscience, 13:369-378, 2010.

Abstract: Neural responses are typically characterized by computing the mean firing rate, but response variability can exist across trials. Many studies have examined the effect of a stimulus on the mean response, but few have examined the effect on response variability. We measured neural variability in 13 extracellularly recorded datasets and one intracellularly recorded dataset from seven areas spanning the four cortical lobes in monkeys and cats. In every case, stimulus onset caused a decline in neural variability. This occurred even when the stimulus produced little change in mean firing rate. The variability decline was observed in membrane potential recordings, in the spiking of individual neurons and in correlated spiking variability measured with implanted 96-electrode arrays. The variability decline was observed for all stimuli tested, regardless of whether the animal was awake, behaving or anaesthetized. This widespread variability decline suggests a rather general property of cortex, that its state is stabilized by an input.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. In Advances in Neural Information Processing Systems 21, pages 1-8, Vancouver, BC, December 2009.

Abstract: We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from many neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional, noisy spiking activity in a compact form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the spike trains are first smoothed over time, then a static dimensionality- reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way and that account for spiking variability, which may vary both across neurons and across time. We then present a novel method for extracting neural trajectories - Gaussian-process factor analysis (GPFA) - which unifies the smoothing and dimensionality- reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that the proposed extensions improved the predictive ability of the two-stage methods. The predictive ability was further improved by going to GPFA. From the extracted trajectories, we directly observed a convergence in neural state during motor planning, an effect that was shown indirectly by previous studies. We then show how such methods can be a powerful tool for relating the spiking activity across a neural population to the subject's behavior on a single-trial basis. Finally, to assess how well the proposed methods characterize neural population activity when the underlying time course is known, we performed simulations that revealed that GPFA performed tens of percent better than the best two-stage method.

C. Chang, J. P. Cunningham, and G. Glover. Influence of heart rate on the bold signal: the cardiac response function. NeuroImage, 44:857-869, 2009.

Abstract: It has previously been shown that low-frequency fluctuations in both respiratory volume and cardiac rate can induce changes in the blood-oxygen level dependent (BOLD) signal. Such physiological noise can obscure the detection of neural activation using fMRI, and it is therefore important to model and remove the effects of this noise. While a hemodynamic response function relating respiratory variation (RV) and the BOLD signal has been described, no such mapping for heart rate (HR) has been proposed. In the current study, the effects of RV and HR are simultaneously deconvolved from resting state fMRI. It is demonstrated that a convolution model including RV and HR can explain significantly more variance in gray matter BOLD signal than a model that includes RV alone, and an average HR response function is proposed that well characterizes our subject population. It is observed that the voxel-wise morphology of the deconvolved RV responses is preserved when HR is included in the model, and that its form is adequately modeled by Birn et al.'s previously described respiration response function. Furthermore, it is shown that modeling out RV and HR can significantly alter functional connectivity maps of the default-mode network.

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. Journal of Neurophysiology, 102:614-635, 2009.

Abstract: We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from many neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional, noisy spiking activity in a compact form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the spike trains are first smoothed over time, then a static dimensionality- reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way and that account for spiking variability, which may vary both across neurons and across time. We then present a novel method for extracting neural trajectories - Gaussian-process factor analysis (GPFA) - which unifies the smoothing and dimensionality- reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that the proposed extensions improved the predictive ability of the two-stage methods. The predictive ability was further improved by going to GPFA. From the extracted trajectories, we directly observed a convergence in neural state during motor planning, an effect that was shown indirectly by previous studies. We then show how such methods can be a powerful tool for relating the spiking activity across a neural population to the subject's behavior on a single-trial basis. Finally, to assess how well the proposed methods characterize neural population activity when the underlying time course is known, we performed simulations that revealed that GPFA performed tens of percent better than the best two-stage method.

J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani. Inferring neural firing rates from spike trains using Gaussian processes. In Advances in Neural Information Processing Systems 20, pages 1-8, Vancouver, BC, December 2008.

Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train's underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators.

Comment: Spotlight Presentation

J. P. Cunningham, K. V. Shenoy, and M. Sahani. Fast Gaussian process methods for point process intensity estimation. In 25th International Conference on Machine Learning, pages 1-8, Helsinki, Finland, June 2008.

Abstract: Point processes are difficult to analyze because they provide only a sparse and noisy observation of the intensity function driving the process. Gaussian Processes offer an attractive framework within which to infer underlying intensity functions. The result of this inference is a continuous function defined across time that is typically more amenable to analytical efforts. However, a naive implementation will become computationally infeasible in any problem of reasonable size, both in memory and run time requirements. We demonstrate problem specific methods for a class of renewal processes that eliminate the memory burden and reduce the solve time by orders of magnitude.

J. P. Cunningham. Derivation of Expectation Propagation for "fast Gaussian process methods for point process intensity estimation". Technical report, Stanford University, 2008.

Abstract: We derive the Expectation Propagation algorithm updates for approximating the posterior distribution on intensity in a conditionally inhomogeneous gamma interval process with a Gaussian Process prior (GP IGIP), a model which appeared in Cunningham, Shenoy, Sahani (2008) ICML.


Alex Davies

Alex Davies and Zoubin Ghahramani. The random forest kernel and other kernels for big data from random partitions. arXiv, abs/1402.4293, 2014.

Abstract: We present Random Partition Kernels, a new class of kernels derived by demonstrating a natural connection between random partitions of objects and kernels between those objects. We show how the construction can be used to create kernels from methods that would not normally be viewed as random partitions, such as Random Forest. To demonstrate the potential of this method, we propose two new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show that these kernels consistently outperform standard kernels on problems involving real-world datasets. Finally, we show how the form of these kernels lend themselves to a natural approximation that is appropriate for certain big data problems, allowing O(N) inference in methods such as Gaussian Processes, Support Vector Machines and Kernel PCA.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In KDD, pages 572-580. ACM, 2013.

Abstract: The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

A. Davies and Z. Ghahramani. Language-independent Bayesian sentiment mining of twitter. In In The Fifth Workshop on Social Network Mining and Analysis (SNA-KDD 2011), August 2011.

Abstract: This paper outlines a new language-independent model for sentiment analysis of short, social-network statuses. We demonstrate this on data from Twitter, modelling happy vs sad sentiment, and show that in some circumstances this outperforms similar Naive Bayes models by more than 10%. We also propose an extension to allow the modelling of differ- ent sentiment distributions in different geographic regions, while incorporating information from neighbouring regions. We outline the considerations when creating a system analysing Twitter data and present a scalable system of data acquisi- tion and prediction that can monitor the sentiment of tweets in real time.


David Duvenaud

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), July 2014.

Abstract: This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.

Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta means. arXiv preprint arXiv:1406.2582, June 2014.

Abstract: Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state-of-the-art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, April 2014.

Abstract: Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani. Warped mixtures for nonparametric cluster shapes. In 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, Washington, July 2013.

Abstract: A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen, Stephen J. Roberts, and Zoubin Ghahramani. Active learning of model evidence using Bayesian quadrature. In Advances in Neural Information Processing Systems 25, pages 46-54, Lake Tahoe, California, USA, December 2012.

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy.

Ferenc Huszár and David Duvenaud. Optimally-weighted herding is Bayesian quadrature. In 28th Conference on Uncertainty in Artificial Intelligence, pages 377-385, Catalina Island, California, July 2012.

Abstract: Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen. Additive Gaussian processes. In Advances in Neural Information Processing Systems 24, pages 226-234, Granada, Spain, December 2011.

Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks.


Frederik Eaton

Frederik Eaton and Zoubin Ghahramani. Model reductions for inference: Generality of pairwise, binary, and planar factor graphs. Neural Computation, 25(5):1213-1260, 2013.

Abstract: We offer a solution to the problem of efficiently translating algorithms between different types of discrete statistical model. We investigate the expressive power of three classes of model-those with binary variables, with pairwise factors, and with planar topology-as well as their four intersections. We formalize a notion of "simple reduction" for the problem of inferring marginal probabilities and consider whether it is possible to "simply reduce" marginal inference from general discrete factor graphs to factor graphs in each of these seven subclasses. We characterize the reducibility of each class, showing in particular that the class of binary pairwise factor graphs is able to simply reduce only positive models. We also exhibit a continuous "spectral reduction" based on polynomial interpolation, which overcomes this limitation. Experiments assess the performance of standard approximate inference algorithms on the outputs of our reductions.

Frederik Eaton and Zoubin Ghahramani. Choosing a variable to clamp: Approximate inference using conditioned belief propagation. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 145-152, Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.

Abstract: In this paper we propose an algorithm for approximate inference on graphical models based on belief propagation (BP). Our algorithm is an approximate version of Cutset Conditioning, in which a subset of variables is instantiated to make the rest of the graph singly connected. We relax the constraint of single-connectedness, and select variables one at a time for conditioning, running belief propagation after each selection. We consider the problem of determining the best variable to clamp at each level of recursion, and propose a fast heuristic which applies back-propagation to the BP updates. We demonstrate that the heuristic performs better than selecting variables at random, and give experimental results which show that it performs competitively with existing approximate inference algorithms.

Comment: Code (in C++ based on libDAI).


Jes Frellsen

Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg. Combining the multicanonical ensemble with generative probabilistic models of local biomolecular structure. In Proceedings of the 59th World Statistics Congress of the International Statistical Institute, pages 139-144, Hong Kong, 2014.

Abstract: Markov chain Monte Carlo is a powerful tool for sampling complex systems such as large biomolecular structures. However, the standard Metropolis-Hastings algorithm suffers from a number of deficiencies when applied to systems with rugged free-energy landscapes. Some of these deficiencies can be addressed with the multicanonical ensemble. In this paper we will present two strategies for applying the multicanonical ensemble to distributions constructed from generative probabilistic models of local biomolecular structure. In particular, we will describe how to use the multicanonical ensemble efficiently in conjunction with the reference ratio method.

Peter Kerpedjiev, Jes Frellsen, Stinus Lindgreen, and Anders Krogh. Adaptable probabilistic mapping of short reads using position specific scoring matrices. BMC bioinformatics, 15:100, 2014, doi 10.1186/1471-2105-15-100.

Abstract: BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases.RESULTS:We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.

Comment: Peter Kerpedjiev and Jes Frellsen contributed equally. Additional resources are available at bwa-pssm.binf.ku.dk

Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, and Anders Krogh. On the accuracy of short read mapping. In Deep Sequencing Data Analysis, volume 1038, pages 39-59. Springer, 2013.

Abstract: The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.

Comment: Peter Menzel and Jes Frellsen contributed equally.


Roger Frigola

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward Rasmussen. Identification of Gaussian process state-space models with particle stochastic approximation EM. In Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC), 2014.

Abstract: Gaussian process state-space models (GP-SSMs) are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional (hyper-)parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification.

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013.

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen. Integrated pre-processing for Bayesian nonlinear system identification with Gaussian processes. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, 2013.

Abstract: We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.


Jurgen Van Gael

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani. Message passing algorithms for the Dirichlet diffusion tree. In 28th International Conference on Machine Learning, 2011.

Abstract: We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

Comment: web site

G. Kasneci, J. Van Gael, T. Graepel, and R. Herbrich. Bayesian knowledge corroboration with logical rules and user feedback. In European Conference on Machine Learning (ECML), Barcelona, Spain, September 2010.

Abstract: Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS [1]. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani. Traffic classification in information poor environments. In 1st International Workshop on Traffic Analysis and Classification (IWCMC '10), Caen, France, July 2010.

Abstract: Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani. Scaling the iHMM: Parallelization versus Hadoop. In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, pages 1235-1240, Bradford, UK, 2010. IEEE Computer Society, doi 10.1109/CIT.2010.223.

Abstract: This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM Beal, Ghahramani and Rasmussen, 2002, in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program. We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani. Probabilistic graphical models for semi-supervised traffic classification. In The 6th International Wireless Communications and Mobile Computing Conference, pages 752-757, Caen, France, 2010.

Abstract: Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

J. Van Gael, A. Vlachos, and Z. Ghahramani. The infinite HMM for unsupervised PoS tagging. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 678-687, Singapore, August 2009. Association for Computational Linguistics.

Abstract: We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations.

F. Doshi-Velez, K.T. Miller, J. Van Gael, and Y.W. Teh. Variational inference for the Indian buffet process. In 12th International Conference on Artificial Intelligence and Statistics, volume 12, pages 137-144, Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.

Abstract: The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on a truncated stick-breaking approximation, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes.

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, and Yee Whye Teh. Variational inference for the Indian buffet process. Technical Report CBL-2009-001, University of Cambridge, Computational and Biological Learning Laboratory, Department of Engineering, April 2009.

Abstract: The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on truncating to infinite models, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes. This technical report is a longer version of Doshi-Velez et al. (2009).

J. Van Gael, Y.W. Teh, and Z. Ghahramani. The infinite factorial hidden Markov model. In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, Advances in Neural Information Processing Systems 21, volume 21, pages 1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

Abstract: The infinite factorial hidden Markov model is a non-parametric extension of the factorial hidden Markov model. Our model defines a probability distribution over an infinite number of independent binary hidden Markov chains which together produce an observable sequence of random variables. Central to our model is a new type of non-parametric prior distribution inspired by the Indian Buffet Process which we call the Indian Buffet Markov Process.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden Markov model. In 25th International Conference on Machine Learning, volume 25, pages 1088-1095, Helsinki, Finland, 2008. ACM.

Abstract: The infinite hidden Markov model is a non-parametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite Hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.

A.B. Goldberg, X. Zhu, J. Van Gael, and D. Andrzejewski. Improving diversity in ranking using absorbing random walks. In Proceedings of NAACL HLT, pages 97-104, Rochester, NY, USA, April 2007.

J. Van Gael and X. Zhu. Correlation clustering for crosslingual link detection. In Manuela M. Veloso, editor, International Joint Conference on Artificial Intelligence (IJCAI), pages 1744-1749, Hyderabad, India, January 2007.

Abstract: The crosslingual link detection problem calls for identifying news articles in multiple languages that report on the same news event. This paper presents a novel approach based on constrained clustering. We discuss a general way for constrained clustering using a recent, graph-based clustering framework called correlation clustering. We introduce a correlation clustering implementation that features linear program chunking to allow processing larger datasets. We show how to apply the correlation clustering algorithm to the crosslingual link detection problem and present experimental results that show correlation clustering improves upon the hierarchical clustering approaches commonly used in link detection, and, hierarchical clustering approaches that take constraints into account.

A. B. Goldberg, D. Andrzejewski, J. Van Gael, B. Settles, X. Zhu, and M. Craven. Ranking biomedical passages for relevance and diversity: University of Wisconsin, Madison at TREC Genomics 2006. In Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006), Gaithersburg, MD, USA, November 2006.

Abstract: We report on the University of Wisconsin, Madison's experience in the TREC Genomics 2006 track, which asks participants to retrieve passages from scientific articles that satisfy biologists' information needs. An emphasis is placed on returning relevant passages that discuss different aspects of the topic. Using an off-the-shelf information retrieval (IR) engine, we focused on query generation and reranking query results to encourage relevance and diversity. For query generation, we automatically identify noun phrases from the topic descriptions, and use online resources to gather synonyms as expansion terms. Our first submission uses the baseline IR engine results. We rerank the passages using a naive clustering-based approach in our second run, and we test GRASSHOPPER, a novel graph-theoretic algorithm based on absorbing random walks, in our third run. While our aspect-level results appear to compare favorably with other participants on average, our query generation techniques failed to produce adequate query results for several topics, causing our passage and document-level evaluation scores to suffer. Furthermore, we surprisingly achieved higher aspect-level scores using the initial ranking than our methods aimed specifically at promoting diversity. While this sounds discouraging, we have several ideas as to why this happened and hope to produce new methods that correct these shortcomings.


Yarin Gal

Yarin Gal and Zoubin Ghahramani. Pitfalls in the use of parallel inference for the Dirichlet process. In Proceedings of the 31th International Conference on Machine Learning (ICML-14), 2014.

Abstract: Recent work done by Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013) has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it – work which has been picked-up and implemented in several different fields. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these requirements (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence, analysing the load balance for the inference and showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation. We end with suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.

Yarin Gal and Phil Blunsom. A systematic Bayesian treatment of the IBM alignment models. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2013.

Abstract: The dominant yet ageing IBM and HMM word alignment models underpin most popular Statistical Machine Translation implementations in use today. Though beset by the limitations of implausible independence assumptions, intractable optimisation problems, and an excess of tunable parameters, these models provide a scalable and reliable starting point for inducing translation systems. In this paper we build upon this venerable base by recasting these models in the non-parametric Bayesian framework. By replacing the categorical distributions at their core with hierarchical Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we provide a more flexible formulation and sidestep the original heuristic optimisation techniques. The resulting models are highly extendible, naturally permitting the introduction of phrasal dependencies. We present extensive experimental results showing improvements in both AER and BLEU when benchmarked against Giza++, including significant improvements over IBM model 4.


Zoubin Ghahramani

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. Beta diffusion trees and hierarchical feature allocations. Technical report, Dept. of Engineering, University of Cambridge, August 2014.

Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), July 2014.

Abstract: This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Cold-start active learning with robust ordinal matrix factorization. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We present a new matrix factorization model for rating data and a corresponding active learning strategy to address the cold-start problem. Cold-start is one of the most challenging tasks for recommender systems: what to recommend with new users or items for which one has little or no data. An approach is to use active learning to collect the most useful initial ratings. However, the performance of active learning depends strongly upon having accurate estimates of i) the uncertainty in model parameters and ii) the intrinsic noisiness of the data. To achieve these estimates we propose a heteroskedastic Bayesian model for ordinal matrix factorization. We also present a computationally efficient framework for Bayesian active learning with this type of complex probabilistic model. This algorithm successfully distinguishes between informative and noisy data points. Our model yields state-of-the-art predictive performance and, coupled with our active learning strategy, enables us to gain useful information in the cold-start setting from the very first active sample.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. Beta diffusion trees. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix factorization with non-random missing data. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We propose a probabilistic matrix factorization model for collaborative filtering that learns from data that is missing not at random (MNAR). Matrix factorization models exhibit state-of-the-art predictive performance in collaborative filtering. However, these models usually assume that the data is missing at random (MAR), and this is rarely the case. For example, the data is not MAR if users rate items they like more than ones they dislike. When the MAR assumption is incorrect, inferences are biased and predictive performance can suffer. Therefore, we model both the generative process for the data and the missing data mechanism. By learning these two models jointly we obtain improved performance over state-of-the-art methods when predicting the ratings and when modeling the data observation process. We present the first viable MF model for MNAR data. Our results are promising and we expect that further research on NMAR models will yield large gains in collaborative filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Stochastic inference for scalable probabilistic modeling of binary matrices. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient because they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data used by our method. For this, we derive an algorithm that adjusts this hyper-parameter online.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. A reversible infinite hmm using normalised random measures. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, April 2014.

Abstract: Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin Ghahramani. Scalable Gaussian Process structured prediction for grid factor graph applications. In 31st International Conference on Machine Learning, 2014.

Abstract: Structured prediction is an important and well studied problem with many applications across machine learning. GPstruct is a recently proposed structured prediction model that offers appealing properties such as being kernelised, non-parametric, and supporting Bayesian inference (Bratières et al. 2013). The model places a Gaussian process prior over energy functions which describe relationships between input variables and structured output variables. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners (predictors) trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes.

Alex Davies and Zoubin Ghahramani. The random forest kernel and other kernels for big data from random partitions. arXiv, abs/1402.4293, 2014.

Abstract: We present Random Partition Kernels, a new class of kernels derived by demonstrating a natural connection between random partitions of objects and kernels between those objects. We show how the construction can be used to create kernels from methods that would not normally be viewed as random partitions, such as Random Forest. To demonstrate the potential of this method, we propose two new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show that these kernels consistently outperform standard kernels on problems involving real-world datasets. Finally, we show how the form of these kernels lend themselves to a natural approximation that is appropriate for certain big data problems, allowing O(N) inference in methods such as Gaussian Processes, Support Vector Machines and Kernel PCA.

Yarin Gal and Zoubin Ghahramani. Pitfalls in the use of parallel inference for the Dirichlet process. In Proceedings of the 31th International Conference on Machine Learning (ICML-14), 2014.

Abstract: Recent work done by Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013) has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it – work which has been picked-up and implemented in several different fields. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these requirements (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence, analysing the load balance for the inference and showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation. We end with suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. In ICML, volume 29 of JMLR Proceedings. JMLR.org, 2014.

Abstract: Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.

Alexander G. D. G. Matthews and Zoubin Ghahramani. Classification using log Gaussian Cox processes. arXiv preprint arXiv:1405.4141, 2014.

Abstract: McCullagh and Yang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to Gaussian processes. In AISTATS, JMLR Proceedings. JMLR.org, 2014.

Abstract: We investigate the Student-t process as an alternative to the Gaussian process as a nonparametric prior over functions. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating away an inverse Wishart process prior over the covariance kernel of a Gaussian process model. We show surprising equivalences between different hierarchical Gaussian process models leading to Student-t processes, and derive a new sampling scheme for the inverse Wishart process, which helps elucidate these equivalences. Overall, we show that a Student-t process can retain the attractive properties of a Gaussian process - a nonparametric representation, analytic marginal and predictive distributions, and easy model selection through covariance kernels - but has enhanced flexibility, and predictive covariances that, unlike a Gaussian process, explicitly depend on the values of training observations. We verify empirically that a Student-t process is especially useful in situations where there are changes in covariance structure, or in applications like Bayesian optimization, where accurate predictive covariances are critical for good performance. These advantages come at no additional computational cost over Gaussian processes.

Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani. Warped mixtures for nonparametric cluster shapes. In 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, Washington, July 2013.

Abstract: A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani. The supervised ibp: Neighbourhood preserving infinite latent feature models. In 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, USA, July 2013.

Abstract: We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.

Creighton Heaukulani and Zoubin Ghahramani. Dynamic probabilistic models for latent feature propagation in social networks. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Current Bayesian models for dynamic social network data have focused on modelling the influence of evolving unobserved structure on observed social interactions. However, an understanding of how observed social relationships from the past affect future unobserved structure in the network has been neglected. In this paper, we introduce a new probabilistic model for capturing this phenomenon, which we call latent feature propagation, in social networks. We demonstrate our model's capability for inferring such latent structure in varying types of social network datasets, and experimental studies show this structure achieves higher predictive performance on link prediction and forecasting tasks.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Gaussian process vine copulas for multivariate dependence. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Dynamic covariance models for multivariate financial time series. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.

Jacob Andreas and Zoubin Ghahramani. A generative model of vector space semantics. ACL 2013, page 91, 2013.

Abstract: We present a novel compositional, generative model for vector space representations of meaning. This model reformulates earlier tensor-based approaches to vector space semantics as a top-down process, and provides efficient algorithms for transformation from natural language to vectors and from vectors to natural language. We describe procedures for estimating the parameters of the model from positive examples of similar phrases, and from distributional representations, then use these procedures to obtain similarity judgments for a set of adjective-noun pairs. The model’s estimation of the similarity of these pairs correlates well with human annotations, demonstrating a substantial improvement over several existing compositional approaches in both settings.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani. Bayesian structured prediction using Gaussian processes. arXiv, abs/1307.3846, 2013.

Abstract: We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M3N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct.

Konstantinos Bousmalis, Stefanos Zafeiriou, Louis-Philippe Morency, Maja Pantic, and Zoubin Ghahramani. Variational hidden conditional random fields with coupled Dirichlet process mixtures. In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Zelezný, editors, ECML/PKDD, volume 8189 of Lecture Notes in Computer Science, pages 531-547. Springer, 2013.

Abstract: Hidden Conditional Random Fields (HCRFs) are discriminative latent variable models which have been shown to successfully learn the hidden structure of a given classification problem. An infinite HCRF is an HCRF with a countably infinite number of hidden states, which rids us not only of the necessity to specify a priori a fixed number of hidden states available but also of the problem of overfitting. Markov chain Monte Carlo (MCMC) sampling algorithms are often employed for inference in such models. However, convergence of such algorithms is rather difficult to verify, and as the complexity of the task at hand increases, the computational cost of such algorithms often becomes prohibitive. These limitations can be overcome by variational techniques. In this paper, we present a generalized framework for infinite HCRF models, and a novel variational inference approach on a model based on coupled Dirichlet Process Mixtures, the HCRF-DPM. We show that the variational HCRF-DPM is able to converge to a correct number of represented hidden states, and performs as well as the best parametric HCRFs -chosen via cross-validation- for the difficult tasks of recognizing instances of agreement, disagreement, and pain in audiovisual sequences.

Frederik Eaton and Zoubin Ghahramani. Model reductions for inference: Generality of pairwise, binary, and planar factor graphs. Neural Computation, 25(5):1213-1260, 2013.

Abstract: We offer a solution to the problem of efficiently translating algorithms between different types of discrete statistical model. We investigate the expressive power of three classes of model-those with binary variables, with pairwise factors, and with planar topology-as well as their four intersections. We formalize a notion of "simple reduction" for the problem of inferring marginal probabilities and consider whether it is possible to "simply reduce" marginal inference from general discrete factor graphs to factor graphs in each of these seven subclasses. We characterize the reducibility of each class, showing in particular that the class of binary pairwise factor graphs is able to simply reduce only positive models. We also exhibit a continuous "spectral reduction" based on polynomial interpolation, which overcomes this limitation. Experiments assess the performance of standard approximate inference algorithms on the outputs of our reductions.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Stochastic inference for scalable probabilistic modeling of binary matrices. In NIPS Workshop on Randomized Methods for Machine Learning, 2013.

Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient since they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data and we propose an algorithm that adjusts this hyper-parameter in an online manner.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani. Active learning for interactive visualization. In 16th International Conference on Artificial Intelligence and Statistics, 2013.

Abstract: Many automatic visualization methods have been proposed. However, a visualization that is automatically generated might be different to how a user wants to arrange the objects in visualization space. By allowing users to re-locate objects in the embedding space of the visualization, they can adjust the visualization to their preference. We propose an active learning framework for interactive visualization which selects objects for the user to re-locate so that they can obtain their desired visualization by re-locating as few as possible. The framework is based on an information theoretic criterion, which favors objects that reduce the uncertainty of the visualization. We present a concrete application of the proposed framework to the Laplacian eigenmap visualization method. We demonstrate experimentally that the proposed framework yields the desired visualization with fewer user interactions than existing methods.

Tomoharu Iwata, Amar Shah, and Zoubin Ghahramani. Discovering latent influence in online social activities via shared cascade poisson processes. In KDD, pages 266-274. ACM, 2013.

Abstract: Many people share their activities with others through online communities. These shared activities have an impact on other users' activities. For example, users are likely to become interested in items that are adopted (e.g. liked, bought and shared) by their friends. In this paper, we propose a probabilistic model for discovering latent influence from sequences of item adoption events. An inhomogeneous Poisson process is used for modeling a sequence, in which adoption by a user triggers the subsequent adoption of the same item by other users. For modeling adoption of multiple items, we employ multiple inhomogeneous Poisson processes, which share parameters, such as influence for each user and relations between users. The proposed model can be used for finding influential users, discovering relations between users and predicting item popularity in the future. We present an efficient Bayesian inference procedure of the proposed model based on the stochastic EM algorithm. The effectiveness of the proposed model is demonstrated by using real data sets in a social bookmark sharing service.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In KDD, pages 572-580. ACM, 2013.

Abstract: The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

Colorado Reed and Zoubin Ghahramani. Scaling the Indian buffet process via submodular maximization. In ICML, volume 28 of JMLR Proceedings, pages 1013-1021. JMLR.org, 2013.

Abstract: Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)'s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model.

Amar Shah and Zoubin Ghahramani. Determinantal clustering processes - A nonparametric Bayesian approach to kernel based semi-supervised clustering. UAI, 2013.

Abstract: Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the number of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen, Stephen J. Roberts, and Zoubin Ghahramani. Active learning of model evidence using Bayesian quadrature. In Advances in Neural Information Processing Systems 25, pages 46-54, Lake Tahoe, California, USA, December 2012.

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. A nonparametric variable clustering model. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An infinite latent attribute model for network data. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a "flat" clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.

A. Bahramisharif, M. A. J. van Gerven, J-M. Schoffelen, Z. Ghahramani, and T. Heskes. The dynamic beamformer. In G. Langs et al, editor, Machine Learning in Interpretation of Neuroimaging (MLINI) 2011 LNAI 7263, pages 148-155, 2012.

Abstract: Beamforming is one of the most commonly used methods for estimating the active neural sources from the MEG or EEG sensor readings. The basic assumption in beamforming is that the sources are uncorrelated, which allows for estimating each source independent of the others. In this paper, we incorporate the independence assumption of the standard beamformer in a linear dynamical system, thereby introducing the dynamic beamformer. Using empirical data, we show that the dynamic beamformer outperforms the standard beamformer in predicting the condition of interest which strongly suggests that it also outperforms the standard method in localizing the active neural generators.

John P. Cunningham, Zoubin Ghahramani, and Carl E. Rasmussen. Gaussian Processes for time-marked time-series data. In 15th International Conference on Artificial Intelligence and Statistics, 2012.

Abstract: In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data.

Zoubin Ghahramani. Bayesian nonparametrics and the probabilistic approach to modelling. Philosophical Transactions of the Royal Society A, 2012.

Abstract: Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian nonparametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics for modelling unknown functions, density estimation, clustering, time series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman's coalescent, Dirichlet diffusion tress, and Wishart processes.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin Ghahramani. Collaborative Gaussian processes for preference learning. In Advances in Neural Information Processing Systems 26, pages 2096-2104. Curran Associates, Inc., 2012.

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms.

Hyun-Chul Kim and Zoubin Ghahramani. Bayesian classifier combination. In 15th International Conference on Artificial Intelligence and Statistics, 2012.

Abstract: Bayesian model averaging linearly mixes the probabilistic predictions of multiple models, each weighted by its posterior probability. This is the coherent Bayesian way of combining multiple models only under certain restrictive assumptions, which we outline. We explore a general framework for Bayesian model combination (which differs from model averaging) in the context of classification. This framework explicitly models the relationship between each model’s output and the unknown true label. The framework does not require that the models be probabilistic (they can even be human assessors), that they share prior information or receive the same training data, or that they be independent in their errors. Finally, the Bayesian combiner does not need to believe any of the models is in fact correct. We test several variants of this classifier combination procedure starting from a classic statistical model proposed by Dawid and Skene (1979) and using MCMC to add more complex but important features to the model. Comparisons on sev- eral data sets to simpler methods like majority voting show that the Bayesian methods not only perform well but result in interpretable diagnostics on the data points and the models.

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics, 2012.

Abstract: Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets.
Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods.

Comment: This paper is available from the Bioinformatics site and a Matlab implementation of MDI is available fromthis site.

Paul D. W. Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, and David L. Wild. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics, 28(24):3290-3297, 2012.

Abstract: MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. RESULTS: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI's performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani. Bayesian and L1 approaches for sparse unsupervised learning. In 29th International Conference on Machine Learning, 2012.

Abstract: The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of "L1", and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods out-perform L1 minimisation, even on a com- putational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

Donglin Niu, Jennifer G. Dy, and Z. Ghahramani. A nonparametric Bayesian model for multiple clustering with overlapping feature views. In 15th International Conference on Artificial Intelligence and Statistics, 2012.

Abstract: Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Chinese Restaurant Process. We provide an inference approach to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view.

Barnabas Poczos, Zoubin Ghahramani, and Jeff Schneider. Copula-based kernel dependency measures. In 29th International Conference on Machine Learning, 2012.

Abstract: The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions.

Jacob Steinhardt and Zoubin Ghahramani. Flexible martingale priors for deep hierarchies. In 15th International Conference on Artificial Intelligence and Statistics, 2012.

Abstract: When building priors over trees for Bayesian hierarchical models, there is a tension between maintaining desirable theoretical properties such as infinite exchangeability and important practical properties such as the ability to increase the depth of the tree to accommodate new data. We resolve this tension by presenting a family of infinitely exchangeable priors over discrete tree structures that allows the depth of the tree to grow with the data, and then showing that our family contains all hierarchical models with certain mild symmetry properties. We also show that deep hierarchical models are in general intimately tied to a process called a martingale, and use Doob’s martingale convergence theorem to demonstrate some unexpected properties of deep hierarchies.

Kyung-Ah Sohn, Zoubin Ghahramani, and Eric P. Xing. Robust estimation of local genetic ancestry in admixed populations using a non-parametric Bayesian approach. Genetics, 191(4), 2012.

Abstract: We present a new haplotype-based approach for inferring local genetic ancestry of individuals in an admixed population. Most existing approaches for local ancestry estimation ignore the latent genetic relatedness between ancestral populations and treat them as independent. In this paper, we exploit such information by building an inheritance model that describes both the ancestral populations and the admixed population jointly in a unified framework. Based on an assumption that the common hypothetical founder haplotypes give rise to both the ancestral and admixed population haplotypes, we employ an infinite hidden Markov model to characterize each ancestral population and further extend it to generate the admixed population. Through an effective utilization of the population structural information under a principled nonparametric Bayesian framework, the resulting model is significantly less sensitive to the choice and the amount of training data for ancestral populations than state-of-the-arts algorithms. We also improve the robustness under deviation from common modeling assumptions by incorporating population-specific scale parameters that allow variable recombination rates in different populations. Our method is applicable to an admixed population from an arbitrary number of ancestral populations and also performs competitively in terms of spurious ancestry proportions under general multi-way admixture assumption. We validate the proposed method by simulation under various admixing scenarios and present empirical analysis results on worldwide distributed dataset from Human Genome Diversity Project.

Comment: doi: 10.1534/genetics.112.140228

Andrew Gordon Wilson and Zoubin Ghahramani. Modelling input varying correlations between multiple responses. In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors, ECML/PKDD, volume 7524 of Lecture Notes in Computer Science, pages 858-861. Springer, 2012.

Abstract: We introduced a generalised Wishart process (GWP) for modelling input dependent covariance matrices Σ(x), allowing one to model input varying correlations and uncertainties between multiple response variables. The GWP can naturally scale to thousands of response variables, as opposed to competing multivariate volatility models which are typically intractable for greater than 5 response variables. The GWP can also naturally capture a rich class of covariance dynamics - periodicity, Brownian motion, smoothness, - through a covariance kernel.

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, and Zoubin Ghahramani. Continuous relaxations for discrete Hamiltonian Monte Carlo. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 3203-3211, 2012.

Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani. Gaussian process regression networks. Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering, University of Cambridge, Cambridge, UK, October 19 2011.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

Comment: arXiv:1110.4411

A. Davies and Z. Ghahramani. Language-independent Bayesian sentiment mining of twitter. In In The Fifth Workshop on Social Network Mining and Analysis (SNA-KDD 2011), August 2011.

Abstract: This paper outlines a new language-independent model for sentiment analysis of short, social-network statuses. We demonstrate this on data from Twitter, modelling happy vs sad sentiment, and show that in some circumstances this outperforms similar Naive Bayes models by more than 10%. We also propose an extension to allow the modelling of differ- ent sentiment distributions in different geographic regions, while incorporating information from neighbouring regions. We outline the considerations when creating a system analysing Twitter data and present a scalable system of data acquisi- tion and prediction that can monitor the sentiment of tweets in real time.

Thomas L. Griffiths and Zoubin Ghahramani. The Indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185-1224, April 2011.

Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Geoff Gordon and David Dunson, editors, 14th International Conference on Artificial Intelligence and Statistics, volume 15, pages 416-424, Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

Abstract: We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task - and associated losses - for which the posterior could be used. We argue that this can be suboptimal and propose instead to loss-calibrate the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths. Testing a Bayesian measure of representativeness using a large image database. In Advances in Neural Information Processing Systems 24, Cambridge, MA, USA, 2011. The MIT Press.

Abstract: How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to define a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories.

Finale Doshi-Velez and Zoubin Ghahramani. A comparison of human and agent reinforcement learning in partially observable domains. In 33rd Annual Meeting of the Cognitive Science Society, Boston, MA, 2011.

Abstract: It is commonly stated that reinforcement learning (RL) algorithms learn slower than humans. In this work, we investigate this claim using two standard problems from the RL literature. We compare the performance of human subjects to RL techniques. We find that context-the meaningfulness of the observations—-plays a significant role in the rate of human RL. Moreover, without contextual information, humans often fare much worse than classic algorithms. Comparing the detailed responses of humans and RL algorithms, we also find that humans appear to employ rather different strategies from standard algorithms, even in cases where they had indistinguishable performance to them. Our research both sheds light on human RL and provides insights for improving RL algorithms.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv, abs/1112.5745, 2011.

Abstract: Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.

David A. Knowles and Zoubin Ghahramani. Nonparametric Bayesian sparse factor models with application to gene expression modelling.. Annals of Applied Statistics, 5(2B):1534-1552, 2011.

Abstract: A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model's utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani. Pitman-Yor diffusion trees. In 27th Conference on Uncertainty in Artificial Intelligence, 2011.

Abstract: We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

Comment: web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani. Message passing algorithms for the Dirichlet diffusion tree. In 28th International Conference on Machine Learning, 2011.

Abstract: We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

Comment: web site

Andrew Gordon Wilson and Zoubin Ghahramani. Generalised Wishart processes. In 27th Conference on Uncertainty in Artificial Intelligence, 2011.

Abstract: We introduce a new stochastic process called the generalised Wishart process (GWP). It is a collection of positive semi-definite random matrices indexed by any arbitrary input variable. We use this process as a prior over dynamic (e.g. time varying) covariance matrices. The GWP captures a diverse class of covariance dynamics, naturally hanles missing data, scales nicely with dimension, has easily interpretable parameters, and can use input variables that include covariates other than time. We describe how to construct the GWP, introduce general procedures for inference and prediction, and show that it outperforms its main competitor, multivariate GARCH, even on financial data that especially suits GARCH.

Comment: Supplementary Material, Best Student Paper Award

Ryan Turner, Steven Bottone, and Zoubin Ghahramani. Fast online anomaly detection using scan statistics. In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors, Machine Learning for Signal Processing (MLSP 2010), pages 385-390, Kittilä, Finland, August 2010.

Abstract: We present methods to do fast online anomaly detection using scan statistics. Scan statistics have long been used to detect statistically significant bursts of events. We extend the scan statistics framework to handle many practical issues that occur in application: dealing with an unknown background rate of events, allowing for slow natural changes in background frequency, the inverse problem of finding an unusual lack of events, and setting the test parameters to maximize power. We demonstrate its use on real and synthetic data sets with comparison to other methods.

Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani. Variational inference for nonparametric multiple clustering. In KDD10 Workshop on Discovering, Summarizing, and Using Multiple Clusterings, Washington, DC, USA, July 2010.

Abstract: Most clustering algorithms produce a single clustering solution. Similarly, feature selection for clustering tries to find one feature subset where one interesting clustering solution resides. However, a single data set may be multi-faceted and can be grouped and interpreted in many different ways, especially for high dimensional data, where feature selection is typically needed. Moreover, different clustering solutions are interesting for different purposes. Instead of committing to one clustering solution, in this paper we introduce a probabilistic nonparametric Bayesian model that can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. We provide a variational inference approach to learn the features and clustering partitions in each view. Our model allows us not only to learn the multiple clusterings and views but also allows us to automatically learn the number of views and the number of clusters in each view.

C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani. Traffic classification in information poor environments. In 1st International Workshop on Traffic Analysis and Classification (IWCMC '10), Caen, France, July 2010.

Abstract: Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

R. P. Adams, H. Wallach, and Zoubin Ghahramani. Learning the structure of deep sparse graphical models. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Artificial Intelligence and Statistics, pages 1-8, Chia Laguna, Sardinia, Italy, May 2010.

Abstract: Deep belief networks are a powerful way to model complex probability distributions. However, it is difficult to learn the structure of a belief network, particularly one with hidden units. The Indian buffet process has been used as a nonparametric Bayesian prior on the structure of a directed belief network with a single infinitely wide hidden layer. Here, we introduce the cascading Indian buffet process (CIBP), which provides a prior on the structure of a layered, directed belief network that is unbounded in both depth and width, yet allows tractable inference. We use the CIBP prior with the nonlinear Gaussian belief network framework to allow each unit to vary its behavior between discrete and continuous representations. We use Markov chain Monte Carlo for inference in this model and explore the structures learned on image data.

Comment: Winner of the Best Paper Award

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani. Dependent Indian buffet processes. In 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W & CP, pages 924-931, Chia Laguna, Sardinia, Italy, May 2010.

Abstract: Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.

R. P. Adams, Zoubin Ghahramani, and Michael I. Jordan. Tree-structured stick breaking for hierarchical data. In Advances in Neural Information Processing Systems 23. The MIT Press, 2010.

Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani. Scaling the iHMM: Parallelization versus Hadoop. In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, pages 1235-1240, Bradford, UK, 2010. IEEE Computer Society, doi 10.1109/CIT.2010.223.

Abstract: This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM Beal, Ghahramani and Rasmussen, 2002, in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program. We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985-1042, 2010.

Abstract: How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as "Kronecker graphs".
First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks.
We then present KRONFIT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, KRONFIT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that KRONFIT finds accurate parameters that very well mimic the properties of target networks. In fact, using just four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synthetic graphs can be used for null-models, anonymization, extrapolations, and graph summarization.

C. Lippert, Z. Ghahramani, and K. Borgwardt. Gene function prediction from synthetic lethality networks via ranking on demand. Bioinformatics, 26:912-918, 2010.

Abstract: Motivation: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks.
Results: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, and Zoubin Ghahramani. Probabilistic graphical models for semi-supervised traffic classification. In The 6th International Wireless Communications and Mobile Computing Conference, pages 752-757, Caen, France, 2010.

Abstract: Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, and K. M. Borgwardt. A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series. Journal of Computational Biology, 17(3):1-13, 2010, doi 10.1089/cmb.2009.0175.

Abstract: Understanding the regulatory mechanisms that are responsible for an organism's response to environmental change is an important issue in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a twosample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates, and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 observed time points. In classification experiments, our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, and D. L. Wild. Discovering transcriptional modules by Bayesian data integration. Bioinformatics, 26:i158-i167, 2010.

Abstract: Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.
Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs.
Availability: If interested in the code for the work presented in this article, please contact the authors.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi. Ranking relations using analogies in biological and information networks. Annals of Applied Statistics, 4(2):615-644, 2010.

Abstract: Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = A(1):B(1), A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.

Andreas Vlachos, Zoubin Ghahramani, and Ted Briscoe. Active learning for constrained Dirichlet process mixture models. In Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics, pages 57-61, Uppsala, Sweden, 2010.

Abstract: Recent work applied Dirichlet Process Mixture Models to the task of verb clustering, incorporating supervision in the form of must-links and cannot-links constraints between instances. In this work, we introduce an active learning approach for constraint selection employing uncertainty-based sampling. We achieve substantial improvements over random selection on two datasets.

Andrew Gordon Wilson and Zoubin Ghahramani. Copula processes. In Advances in Neural Information Processing Systems 23, 2010. Spotlight.

Abstract: We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures.

Comment: Supplementary Material, slides.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani. Large scale non-parametric inference: Data parallelisation in the Indian buffet process. In Advances in Neural Information Processing Systems 23, pages 1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani. Bayesian exponential family PCA. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1089-1096, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.

Comment: spotlight.

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, and K Borgwardt. Discovering temporal patterns of differential gene expression in microarray time series. In German Conference on Bioinformatics, pages 133-142, Halle, Germany, September 2009.

Abstract: A wealth of time series of microarray measurements have become available over recent years. Several two-sample tests for detecting differential gene expression in these time series have been defined, but they can only answer the question whether a gene is differentially expressed across the whole time series, not in which intervals it is differentially expressed. In this article, we propose a Gaussian process based approach for studying these dynamics of differential gene expression. In experiments on Arabidopsis thaliana gene expression levels, our novel technique helps us to uncover that the family of WRKY transcription factors appears to be involved in the early response to infection by a fungal pathogen.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant, K. Denby, and D. L. Wild. R/BHC: fast Bayesian hierarchical clustering for microarray data. BMC Bioinformatics 2009, 10(242):1-9, August 2009, doi 10.1186/1471-2105-10-242.

Abstract: Background: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained.
Results: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge.
Conclusion: Biologically plausible results are presented from a well studied data set: expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.

J. Van Gael, A. Vlachos, and Z. Ghahramani. The infinite HMM for unsupervised PoS tagging. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 678-687, Singapore, August 2009. Association for Computational Linguistics.

Abstract: We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations.

R. Adams and Zoubin Ghahramani. Archipelago: nonparametric Bayesian semi-supervised learning. In Léon Bottou and Michael Littman, editors, 26th International Conference on Machine Learning, pages 1-8, Montréal, QC, Canada, June 2009. Omnipress.

Abstract: Semi-supervised learning (SSL), is classification where additional unlabeled data can be used to improve accuracy. Generative approaches are appealing in this situation, as a model of the data's probability density can assist in identifying clusters. Nonparametric Bayesian methods, while ideal in theory due to their principled motivations, have been difficult to apply to SSL in practice. We present a nonparametric Bayesian method that uses Gaussian processes for the generative model, avoiding many of the problems associated with Dirichlet process mixture models. Our model is fully generative and we take advantage of recent advances in Markov chain Monte Carlo algorithms to provide a practical inference method. Our method compares favorably to competing approaches on synthetic and real-world multi-class data.

Comment: This paper was awarded Honourable Mention for Best Paper at ICML 2009.

F. Doshi-Velez and Z. Ghahramani. Correlated non-parametric latent feature models. In Conference on Uncertainty in Artificial Intelligence (UAI 2009), pages 143-150, Montréal, QC, Canada, June 2009. AUAI Press.

Abstract: We are often interested in explaining data through a set of hidden factors or features. To allow for an unknown number of such hidden features, one can use the IBP: a non-parametric latent feature model that does not bound the number of active features in a dataset. However, the IBP assumes that all latent features are uncorrelated, making it inadequate for many real-world problems. We introduce a framework for correlated non-parametric feature models, generalising the IBP. We use this framework to generate several specific models and demonstrate applications on real-world datasets.

Finale Doshi-Velez and Zoubin Ghahramani. Accelerated Gibbs sampling for the Indian buffet process. In Léon Bottou and Michael Littman, editors, 26th International Conference on Machine Learning, pages 273-280, Montréal, QC, Canada, June 2009. Omnipress.

Abstract: We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a non-parametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

R. Silva and Z. Ghahramani. The hidden life of latent variables: Bayesian learning with mixed graph models. Journal of Machine Learning Research, 10:1187-1238, June 2009.

Abstract: Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for inference is the specification of the distribution over parameters of the models. We introduce a new distribution for covariance matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian machine learning tasks can benefit from the principle presented here: the power to model dependencies that are generated from hidden variables, but without necessarily modeling such variables explicitly.

W. Chu and Z. Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 89-96, Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal of Machine Learning Research. ISSN 1938-7228.

Abstract: In multiway data, each sample is measured by multiple sets of correlated attributes. We develop a probabilistic framework for modeling structural dependency from partially observed multi-dimensional array data, known as pTucker. Latent components associated with individual array dimensions are jointly retrieved while the core tensor is integrated out. The resulting algorithm is capable of handling large-scale data sets. We verify the usefulness of this approach by comparing against classical models on applications to modeling amino acid fluorescence, collaborative filtering and a number of benchmark multiway array data.

Frederik Eaton and Zoubin Ghahramani. Choosing a variable to clamp: Approximate inference using conditioned belief propagation. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 145-152, Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research.

Abstract: In this paper we propose an algorithm for approximate inference on graphical models based on belief propagation (BP). Our algorithm is an approximate version of Cutset Conditioning, in which a subset of variables is instantiated to make the rest of the graph singly connected. We relax the constraint of single-connectedness, and select variables one at a time for conditioning, running belief propagation after each selection. We consider the problem of determining the best variable to clamp at each level of recursion, and propose a fast heuristic which applies back-propagation to the BP updates. We demonstrate that the heuristic performs better than selecting variables at random, and give experimental results which show that it performs competitively with existing approximate inference algorithms.

Comment: Code (in C++ based on libDAI).

C. Lippert, O. Stegle, Z. Ghahramani, and K. Borgwardt. A kernel method for unsupervised structured network inference. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 368-375, Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research. ISSN: 1938-7228.

Abstract: Network inference is the problem of inferring edges between a set of real-world objects, for instance, interactions between pairs of proteins in bioinformatics. Current kernel-based approaches to this problem share a set of common features: (i) they are supervised and hence require labeled training data; (ii) edges in the network are treated as mutually independent and hence topological properties are largely ignored; (iii) they lack a statistical interpretation. We argue that these common assumptions are often undesirable for network inference, and propose (i) an unsupervised kernel method (ii) that takes the global structure of the network into account and (iii) is statistically motivated. We show that our approach can explain commonly used heuristics in statistical terms. In experiments on social networks, dfferent variants of our method demonstrate appealing predictive performance.

R. Silva and Z. Ghahramani. Factorial mixture of Gaussians and the marginal independence model. In 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 520-527, Clearwater Beach, FL, USA, April 2009. Journal of Machine Learning Research. ISSN: 1938-7228.

Abstract: Marginal independence constraints play an important role in learning with graphical models. One way of parameterizing a model of marginal independencies is by building a latent variable model where two independent observed variables have no common latent source. In sparse domains, however, it might be advantageous to model the marginal observed distribution directly, without explicitly including latent variables in the model. There have been recent advances in Gaussian and binary models of marginal independence, but no models with non-linear dependencies between continuous variables has been proposed so far. In this paper, we describe how to generalize the Gaussian model of marginal independencies based on mixtures, and how to learn parameters. This requires a non-standard parameterization and raises difficult non-linear optimization issues.

Comment: Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip

T. Stepleton, Z. Ghahramani, G. Gordon, and T.-S. Lee. The block diagonal infinite hidden Markov model. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 552-559, Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper) Journal of Machine Learning Research. ISSN 1938-7228.

Abstract: The Infinite Hidden Markov Model (IHMM) extends hidden Markov models to have a countably infinite number of hidden states (Beal et al., 2002; Teh et al., 2006). We present a generalization of this framework that introduces nearly block-diagonal structure in the transitions between the hidden states, where blocks correspond to "sub-behaviors" exhibited by data sequences. In identifying such structure, the model classifies, or partitions, sequence data according to these sub-behaviors in an unsupervised way. We present an application of this model to artificial data, a video gesture classification task, and a musical theme labeling task, and show that components of the model can also be applied to graph segmentation.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani. Tree-based inference for Dirichlet process mixtures. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 623-630, Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal of Machine Learning Research (online). ISSN 1938-7228.

Abstract: The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density estimation. Unfortunately, like in many statistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational Bayesian (VB) [4] and collapsed variational methods [5], [6] recently introduced a novel class of approximation for DPMs based on Bayesian hierarchical clustering (BHC). These tree-based combinatorial approximations efficiently sum over exponentially many ways of partitioning the data and offer a novel lower bound on the marginal likelihood of the DPM [6]. In this paper we make the following contributions: (1) We show empirically that the BHC lower bounds are substantially tighter than the bounds given by VB [4] and by collapsed variational methods [5] on synthetic and real datasets. (2) We also show that BHC offers a more accurate predictive performance on these datasets. (3) We further improve the tree-based lower bounds with an algorithm that efficiently sums contributions from alternative trees. (4) We present a fast approximate method for BHC. Our results suggest that our combinatorial approximate inference methods and lower bounds may be useful not only in DPMs but in other models as well.

A. Vlachos, A Korhonen, and Z. Ghahramani. Unsupervised and constrained Dirichlet process mixture models for verb clustering. In 4th Workshop on Statistical Machine Translation, EACL '09, Athens, Greece, March 2009.

Abstract: In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

Karsten M. Borgwardt and Zoubin Ghahramani. Bayesian two-sample tests. arXiv, abs/0906.4032, 2009.

Abstract: In this paper, we present two classes of Bayesian approaches to the two-sample problem. Our first class of methods extends the Bayesian t-test to include all parametric models in the exponential family and their conjugate priors. Our second class of methods uses Dirichlet process mixtures (DPM) of such conjugate-exponential distributions as flexible nonparametric priors over the unknown distributions.

Finale Doshi-Velez and Zoubin Ghahramani. Accelerated sampling for the Indian buffet process. In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors, ICML, volume 382 of ACM International Conference Proceeding Series, page 35. ACM, 2009.

Abstract: We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a nonparametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L. Wild. Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):615-628, 2009, doi 10.1109/TCBB.2007.70269.

Abstract: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, and Karsten Borgwardt. A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series. In 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2009), volume 5541 of Lecture Notes in Bioinformatics, pages 201-216, Tucson, AZ, USA, 2009. Springer-Verlag, doi 10.1007/978-3-642-02008-7_14.

Abstract: Understanding the regulatory mechanisms that are responsible for an organism's response to environmental changes is an important question in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 time points. In classification experiments our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

Andreas Vlachos, Anna Korhonen, and Zoubin Ghahramani. Unsupervised and constrained Dirichlet process mixture models for verb clustering. In Proceedings of the workshop on geometrical models of natural language semantics, pages 74-82. Association for Computational Linguistics, 2009.

Abstract: In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DP- MMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation per- formed highlights the benefits of both standard and constrained DPMMs com- pared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

C. Hübler, K. Borgwardt, H.-P. Kriegel, and Z. Ghahramani. Metropolis algorithms for representative subgraph sampling. In Proceedings of 8th IEEE International Conference on Data Mining (ICDM 2008), pages 283-292, Pisa, Italy, December 2008. IEEE. ISSN: 1550-4786.

Abstract: While data mining in chemoinformatics studied graph data with dozens of nodes, systems biology and the Internet are now generating graph data with thousands and millions of nodes. Hence data mining faces the algorithmic challenge of coping with this significant increase in graph size: Classic algorithms for data analysis are often too expensive and too slow on large graphs.
While one strategy to overcome this problem is to design novel efficient algorithms, the other is to 'reduce' the size of the large graph by sampling. This is the scope of this paper: We will present novel Metropolis algorithms for sampling a 'representative' small subgraph from the original large graph, with 'representative' describing the requirement that the sample shall preserve crucial graph properties of the original graph. In our experiments, we improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by producing representative subgraph samples that are both smaller and of higher quality than those produced by other methods from the literature.

H. Kim and Zoubin Ghahramani. Outlier robust Gaussian process classification. In L. Niels da Vitoria, editor, Structural, Syntactic and Statistical Pattern Recognition, volume 5342 of Lecture Notes in Computer Science (LNCS), pages 896-905, Berlin, Germany, December 2008. Springer Berlin / Heidelberg.

Abstract: Gaussian process classifiers (GPCs) are a fully statistical model for kernel classification. We present a form of GPC which is robust to labeling errors in the data set. This model allows label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We derive an outlier robust algorithm for training this model which alternates iterations based on the EP approximation and hyperparameter updates until convergence. We show the usefulness of the proposed algorithm with model selection method through simulation results.

R. Silva, W. Chu, and Zoubin Ghahramani. Hidden common cause relations in relational learning. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1345-1352, Cambridge, MA, USA, December 2008. The MIT Press.

Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies.

Comment: Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/xgp

J.M. Sung, Z. Ghahramani, and S.Y. Bang. Second-order latent space variational Bayes for approximate Bayesian inference. IEEE Signal Processing Letters, 15:918-921, December 2008.

Abstract: In this letter, we consider a variational approximate Bayesian inference framework, latent-space variational Bayes (LSVB), in the general context of conjugate-exponential family models with latent variables. In the LSVB approach, we integrate out model parameters in an exact way and then perform the variational inference over only the latent variables. It can be shown that LSVB can achieve better estimates of the model evidence as well as the distribution over the latent variables than the popular variational Bayesian expectation-maximization (VBEM). However, the distribution over the latent variables in LSVB has to be approximated in practice. As an approximate implementation of LSVB, we propose a second-order LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of a first-order approximation in LSVB. SoLSVB can capture higher order statistics neglected in VBEM and can therefore achieve a better approximation. Examples of Gaussian mixture models are used to illustrate the comparison between our method and VBEM, demonstrating the improvement.

J. Van Gael, Y.W. Teh, and Z. Ghahramani. The infinite factorial hidden Markov model. In D. Koller, D. Schuurmans, L. Bottou, and Y. Bengio, editors, Advances in Neural Information Processing Systems 21, volume 21, pages 1697-1704, Cambridge, MA, USA, December 2008. The MIT Press.

Abstract: The infinite factorial hidden Markov model is a non-parametric extension of the factorial hidden Markov model. Our model defines a probability distribution over an infinite number of independent binary hidden Markov chains which together produce an observable sequence of random variables. Central to our model is a new type of non-parametric prior distribution inspired by the Indian Buffet Process which we call the Indian Buffet Markov Process.

J. Zhang, Z. Ghahramani, and Y. Yang. Flexible latent variable models for multi-task learning. Machine Learning, 73(3):221-242, December 2008.

Abstract: Given multiple prediction problems such as regression and classification, we are interested in a joint inference framework which can effectively borrow information among tasks to improve the prediction accuracy, especially when the number of training examples per problem is small. In this paper we propose a probabilistic framework which can support a set of latent variable models for different multi-task learning scenarios. We show that the framework is a generalization of standard learning methods for single prediction problems and it can effectively model the shared structure among different prediction tasks. Furthermore, we present efficient algorithms for the empirical Bayes method as well as point estimation. Our experiments on both simulated datasets and real world classification datasets show the effectiveness of the proposed models in two evaluation settings: standard multi-task learning setting and transfer learning setting.

J.M. Sung, Z. Ghahramani, and S.Y. Bang. Latent space variational Bayes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12):2236-2242, November 2008.

Abstract: Variational Bayesian Expectation-Maximization (VBEM), an approximate inference method for probabilistic models based on factorizing over latent variables and model parameters, has been a standard technique for practical Bayesian inference. In this paper, we introduce a more general approximate inference framework for conjugate-exponential family models, which we call Latent-Space Variational Bayes (LSVB). In this approach, we integrate out the model parameters in an exact way, leaving only the latent variables. It can be shown that the LSVB approach gives better estimates of the model evidence as well as the distribution over the latent variables than the VBEM approach, but, in practice, the distribution over the latent variables has to be approximated. As a practical implementation, we present a First-order LSVB (FoLSVB) algorithm to approximate the distribution over the latent variables. From this approximate distribution, one can also estimate the model evidence and the posterior over the model parameters. The FoLSVB algorithm is directly comparable to the VBEM algorithm and has the same computational complexity. We discuss how LSVB generalizes the recently proposed collapsed variational methods to general conjugate-exponential families. Examples based on mixtures of Gaussians and mixtures of Bernoullis with synthetic and real-world data sets are used to illustrate some advantages of our method over VBEM.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani. Statistical models for partial membership. In Andrew McCallum and Sam Roweis, editors, 25th International Conference on Machine Learning, pages 392-399, Helsinki, Finland, July 2008. Omnipress.

Abstract: We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data (Gasch & Eisen, 2002). Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

A. Vlachos, Z. Ghahramani, and A Korhonen. Dirichlet process mixture models for verb clustering. In Guillaume Bouchard, Hal Daumé III, Marc Dymetman, and Yee Whye Teh, editors, ICML Workshop on Prior Knowledge for Text and Language Processing, pages 43-48, Helsinki, Finland, July 2008.

Abstract: In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin's (1993) verb classes using the recently introduced V-measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.

Hyun-Chul Kim and Zoubin Ghahramani. Outlier robust Gaussian process classification. In Niels da Vitoria Lobo, Takis Kasparis, Fabio Roli, James Tin-Yau Kwok, Michael Georgiopoulos, Georgios C. Anagnostopoulos, and Marco Loog, editors, SSPR/SPR, volume 5342 of Lecture Notes in Computer Science, pages 896-905. Springer, 2008.

Abstract: Gaussian process classifiers (GPCs) are a fully statistical model for kernel classification. We present a form of GPC which is robust to labeling errors in the data set. This model allows label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We derive an outlier robust algorithm for training this model which alternates iterations based on the EP approximation and hyperparameter updates until convergence. We show the usefulness of the proposed algorithm with model selection method through simulation results.

Andreas Vlachos, Zoubin Ghahramani, and Anna Korhonen. Dirichlet process mixture models for verb clustering. In Proceedings of the ICML workshop on Prior Knowledge for Text and Language, 2008.

Abstract: In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin’s (1993) verb classes using the recently introduced V- measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden Markov model. In 25th International Conference on Machine Learning, volume 25, pages 1088-1095, Helsinki, Finland, 2008. ACM.

Abstract: The infinite hidden Markov model is a non-parametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite Hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.

Sinead Williamson and Zoubin Ghahramani. Probabilistic models for data combination in recommender systems. In Learning from Multiple Sources Workshop, NIPS Conference, Whistler Canada, 2008.

W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19 of Bradford Books, pages 289-296, Cambridge, MA, USA, September 2007. The MIT Press. Online contents gives pages 314-321, and 289-296 on pdf of contents.

Abstract: Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent prior for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm.

David Knowles and Zoubin Ghahramani. Infinite sparse factor analysis and infinite independent components analysis. In 7th International Conference on Independent Component Analysis and Signal Separation, pages 381-388, London, UK, September 2007. Springer, doi 10.1007/978-3-540-74494-8_48.

Abstract: A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.

E. Meeds, Z. Ghahramani, R. Neal, and S.T. Roweis. Modelling dyadic data with binary latent factors. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, Bradford Books, pages 977-984, Cambridge, MA, USA, September 2007. The MIT Press. Online contents gives pages 1002-1009, and 977-984 on pdf contents.

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data.

F. Pérez-Cruz, Zoubin Ghahramani, and M. Pontil. Conditional graphical models. In G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan, editors, Predicting Structured Data, pages 265-282. The MIT Press, Cambridge, MA, USA, September 2007. Chapter 12.

Abstract: In this chapter we propose a modification of CRF-like algorithms that allows for solving large-scale structured classification problems. Our approach consists in upper bounding the CRF functional in order to decompose its training into independent optimisation problems per clique. Furthermore we show that each sub-problem corresponds to solving a multiclass learning task in each clique, which enlarges the applicability of these tools for large-scale structural learning problems. Before presenting the Conditional Graphical Model (CGM), as we refer to this procedure, we review the family of CRF algorithms. We concentrate on the best known procedures and standard generalisations of CRFs. The ob jective of this introduction is analysing from the same viewpoint the proposed solutions in the literature to tackle this problem, which allows comparing their different features. We complete the chapter with a case study, in which we show the possibility to work with large-scale problems using CGM and that the obtained performance is comparable to the result with CRF-like algorithms.

Z. Ghahramani, T.L. Griffiths, and P. Sollich. Bayesian nonparametric latent feature models (with discussion). In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith, and M. West, editors, Bayesian Statistics 8, pages 201-226, Oxford, UK, July 2007. Oxford University Press.

Abstract: We describe a flexible nonparametric approach to latent variable modelling in which the number of latent variables is unbounded. This approach is based on a probability distribution over equivalence classes of binary matrices with a finite number of rows, corresponding to the data points, and an unbounded number of columns, corresponding to the latent variables. Each data point can be associated with a subset of the possible latent variables, which we refer to as the latent features of that data point. The binary variables in the matrix indicate which latent feature is possessed by which data point, and there is a potentially infinite array of features. We derive the distribution over unbounded binary matrices by taking the limit of a distribution over N×K binary matrices as K→∞. We define a simple generative processes for this distribution which we call the Indian buffet process (IBP; Griffiths and Ghahramani, 2005, 2006) by analogy to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a single hyperparameter which controls both the number of feature per ob ject and the total number of features. We describe a two-parameter generalization of the IBP which has additional flexibility, independently controlling the number of features per object and the total number of features in the matrix. The use of this distribution as a prior in an infinite latent feature model is illustrated, and Markov chain Monte Carlo algorithms for inference are described.

Comment: Includes discussion by David Dunson, and rejoinder.

Katherine A. Heller and Zoubin Ghahramani. A nonparametric Bayesian approach to modeling overlapping clusters. In Marina Meila and Xiaotong Shen, editors, AISTATS, volume 2 of JMLR Proceedings, pages 187-194. JMLR.org, 2007.

Abstract: Although clustering data into mutually exclusive partitions has been an extremely successful approach to unsupervised learning, there are many situations in which a richer model is needed to fully represent the data. This is the case in problems where data points actually simultaneously belong to multiple, overlapping clusters. For example a particular gene may have several functions, therefore belonging to several distinct clusters of genes, and a biologist may want to discover these through unsupervised modeling of gene expression data. We present a new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions, much like products of experts (Hinton, 2002). The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and Ghahramani, 2006). The IOMM has the desirable properties of being able to focus in on overlapping regions while maintaining the ability to model a potentially infinite number of clusters which may overlap. We derive MCMC inference algorithms for the IOMM and show that these can be used to cluster movies into multiple genres.

Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approximations. In M. Meila and X. Shen, editors, 11th International Conference on Artificial Intelligence and Statistics. Omnipress, 2007.

Abstract: Gaussian process (GP) models are flexible probabilistic nonparametric models for regression, classification and other tasks. Unfortunately they suffer from computational intractability for large data sets. Over the past decade there have been many different approximations developed to reduce this cost. Most of these can be termed global approximations, in that they try to summarize all the training data via a small set of support points. A different approach is that of local regression, where many local experts account for their own part of space. In this paper we start by investigating the regimes in which these different approaches work well or fail. We then proceed to develop a new sparse GP approximation which is a combination of both the global and local approaches. Theoretically we show that it is derived as a natural extension of the framework developed by Quiñonero-Candela and Rasmussen for sparse GP approximations. We demonstrate the benefits of the combined approximation on some 1D examples for illustration, and on some large real-world data sets.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani. Analogical reasoning with relational Bayesian sets. In Marina Meila and Xiaotong Shen, editors, AISTATS, volume 2 of JMLR Proceedings, pages 500-507. JMLR.org, 2007.

Abstract: Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. There are many ways in which objects can be related, making automated analogical reasoning very chal- lenging. Here we develop an approach which, given a set of pairs of related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs A:B fit in with the set S. This addresses the question: is the relation between objects A and B analogous to those relations found in S? We recast this classi- cal problem as a problem of Bayesian analy- sis of relational data. This problem is non- trivial because direct similarity between ob- jects is not a good way of measuring analo- gies. For instance, the analogy between an electron around the nucleus of an atom and a planet around the Sun is hardly justified by isolated, non-relational, comparisons of an electron to a planet, and a nucleus to the Sun. We develop a generative model for predicting the existence of relationships and extend the framework of Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous a relation is to other relations. This sheds new light on an old problem, which we motivate and illustrate through practical applications in exploratory data analysis.

Yee Whye Teh, Dilan Görür, and Zoubin Ghahramani. Stick-breaking construction for the Indian buffet process. In Marina Meila and Xiaotong Shen, editors, AISTATS, volume 2 of JMLR Proceedings, pages 556-563. JMLR.org, 2007.

Abstract: The Indian buffet process (IBP) is a Bayesian nonparametric distribution whereby objects are modelled using an unbounded number of latent features. In this paper we derive a stick-breaking representation for the IBP. Based on this new representation, we develop slice samplers for the IBP that are efficient, easy to implement and are more generally applicable than the currently available Gibbs sampler. This representation, along with the work of Thibaux and Jordan, also illuminates interesting theoretical connections between the IBP, Chinese restaurant processes, Beta processes and Dirichlet processes.

T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 475-482, Cambridge, MA, USA, December 2006. The MIT Press.

Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset.

Zoubin Ghahramani and Katherine A. Heller. Bayesian sets. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 435-442, Cambridge, MA, USA, December 2006. The MIT Press.

Abstract: Inspired by "Google™ Sets", we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a model-based concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions.

Arik Azran and Zoubin Ghahramani. A new approach to data driven clustering. In William Cohen and Andrew Moore, editors, 23rd International Conference on Machine Learning, pages 57-64, Pittsburgh, PA, USA, June 2006. Omnipress.

Abstract: We consider the problem of clustering in its most basic form where only a local metric on the data space is given. No parametric statistical model is assumed, and the number of clusters is learned from the data. We introduce, analyze and demonstrate a novel approach to clustering where data points are viewed as nodes of a graph, and pairwise similarities are used to derive a transition probability matrix P for a Markov random walk between them. The algorithm automatically reveals structure at increasing scales by varying the number of steps taken by this random walk. Points are represented as rows of Pt, which are the t-step distributions of the walk starting at that point; these distributions are then clustered using a KL-minimizing iterative algorithm. Both the number of clusters, and the number of steps that best reveal it, are found by optimizing spectral properties of P.

Arik Azran and Zoubin Ghahramani. Spectral methods for automatic multiscale data clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 190-197, New York, NY, USA, June 2006. IEEE Computer Society, doi 10.1109/CVPR.2006.289.

Abstract: Spectral clustering is a simple yet powerful method for finding structure in data using spectral properties of an associated pairwise similarity matrix. This paper provides new insights into how the method works and uses these to derive new algorithms which given the data alone automatically learn different plausible data partitionings. The main theoretical contribution is a generalization of a key result in the field, the multicut lemma (Meila 2001). We use this generalization to derive two algorithms. The first uses the eigenvalues of a given affinity matrix to infer the number of clusters in data, and the second combines learning the affinity matrix with inferring the number of clusters. A hierarchical implementation of the algorithms is also derived. The algorithms are theoretically motivated and demonstrated on nontrivial data sets.

Wei Chu, Zoubin Ghahramani, Roland Krause, and David L. Wild. Identifying protein complexes in high-throughput protein interaction screens using an infinite latent feature model. In Russ B. Altman, Tiffany Murray, Teri E. Klein, A. Keith Dunker, and Lawrence Hunter, editors, Pacific Symposium on Biocomputing, pages 231-242. World Scientific, 2006.

Abstract: We propose a Bayesian approach to identify protein complexes and their constituents from high-throughput protein-protein interaction screens. An infinite latent feature model that allows for multi-complex membership by individual proteins is coupled with a graph diffusion kernel that evaluates the likelihood of two proteins belonging to the same complex. Gibbs sampling is then used to infer a catalog of protein complexes from the interaction screen data. An advantage of this model is that it places no prior constraints on the number of complexes and automatically infers the number of significant complexes from the data. Validation results using affinity purification/mass spectrometry experimental data from yeast RNA-processing complexes indicate that our method is capable of partitioning the data in a biologically meaningful way. A supplementary web site containing larger versions of the figures is available at http://public.kgi.edu/wild/PSBO6/index.html.

Wei Chu, Zoubin Ghahramani, Alexei A. Podtelezhnikov, and David L. Wild. Bayesian segmental models with multiple sequence alignment profiles for protein secondary structure and contact map prediction. IEEE/ACM Trans. Comput. Biology Bioinform., 3(2):98-113, 2006.

Abstract: In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in beta-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html.

Katherine A. Heller and Zoubin Ghahramani. A simple Bayesian framework for content-based image retrieval. In CVPR, pages 2110-2117. IEEE Computer Society, 2006.

Abstract: We present a Bayesian framework for content-based image retrieval which models the distribution of color and texture features within sets of related images. Given a userspecified text query (e.g. "penguins") the system first extracts a set of images, from a labelled corpus, corresponding to that query. The distribution over features of these images is used to compute a Bayesian score for each image in a large unlabelled corpus. Unlabelled images are then ranked using this score and the top images are returned. Although the Bayesian score is based on computing marginal likelihoods, which integrate over model parameters, in the case of sparse binary data the score reduces to a single matrix-vector multiplication and is therefore extremely efficient to compute. We show that our method works surprisingly well despite its simplicity and the fact that no relevance feedback is used. We compare different choices of features, and evaluate our results using human subjects.

Hyun-Chul Kim and Zoubin Ghahramani. Bayesian Gaussian process classification with the EM-EP algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 28(12):1948-1959, 2006.

Abstract: Gaussian process classifiers (GPCs) are Bayesian probabilistic kernel classifiers. In GPCs, the probability of belonging to a certain class at an input location is monotonically related to the value of some latent function at that location. Starting from a Gaussian process prior over this latent function, data are used to infer both the posterior over the latent function and the values of hyperparameters to determine various aspects of the function. Recently, the expectation propagation (EP) approach has been proposed to infer the posterior over the latent function. Based on this work, we present an approximate EM algorithm, the EM-EP algorithm, to learn both the latent function and the hyperparameters. This algorithm is found to converge in practice and provides an efficient Bayesian framework for learning hyperparameters of the kernel. A multiclass extension of the EM-EP algorithm for GPCs is also derived. In the experimental results, the EM-EP algorithms are as good or better than other methods for GPCs or Support Vector Machines (SVMs) with cross-validation.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang. Appearance-based gender classification with Gaussian processes. Pattern Recognition Letters, 27(6):618-626, 2006.

Abstract: This paper concerns the gender classification task of discriminating between images of faces of men and women from face images. In appearance-based approaches, the initial images are preprocessed (e.g. normalized) and input into classifiers. Recently, support vector machines (SVMs) which are popular kernel classifiers have been applied to gender classification and have shown excellent performance. SVMs have difficulty in determining the hyperparameters in kernels (using cross-validation). We propose to use Gaussian process classifiers (GPCs) which are Bayesian kernel classifiers. The main advantage of GPCs over SVMs is that they determine the hyperparameters of the kernel based on Bayesian model selection criterion. The experimental results show that our methods outperformed SVMs with cross-validation in most of data sets. Moreover, the kernel hyperparameters found by GPCs using Bayesian methods can be used to improve SVM performance.

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, and Sung Yang Bang. Gender classification with Bayesian kernel methods. In William W. Cohen and Andrew Moore, editors, IJCNN, volume 148 of ACM International Conference Proceeding Series, pages 3371-3376. ACM, 2006.

Abstract: We consider the gender classification task of discriminating between images of faces of men and women from face images. In appearance-based approaches, the initial images are preprocessed (e.g. normalized) and input into classifiers. Recently, SVMs which are popular kernel classifiers have been applied to gender classification and have shown excellent performance. We propose to use one of Bayesian kernel methods which is Gaussian process classifiers (GPCs) for gender classification. The main advantage of Bayesian kernel methods such as GPCs over SVMs is that they determine the hyperparameters of the kernel based on Bayesian model selection criterion. Our results show that GPCs outperformed SVMs with cross validation.

Iain Murray, Zoubin Ghahramani, and David J. C. MacKay. MCMC for doubly-intractable distributions. In UAI. AUAI Press, 2006.

Abstract: Markov Chain Monte Carlo (MCMC) algorithms are routinely used to draw samples from distributions with intractable normalization constants. However, standard MCMC algorithms do not apply to doubly-intractable distributions in which there are additional parameter-dependent normalization terms; for example, the posterior over parameters of an undirected graphical model. An ingenious auxiliary-variable scheme (Møller et al., 2004) offers a solution: exact sampling (Propp and Wilson, 1996) is used to sample from a Metropolis-Hastings proposal for which the acceptance probability is tractable. Unfortunately the acceptance probability of these expensive updates can be low. This paper provides a generalization of Møller et al. (2004) and a new MCMC algorithm, which obtains better acceptance probabilities for the same amount of exact sampling, and removes the need to estimate model parameters before sampling begins.

Ricardo Silva and Zoubin Ghahramani. Bayesian inference for Gaussian mixed graph models. In UAI. AUAI Press, 2006.

Abstract: We introduce priors and algorithms to perform Bayesian inference in Gaussian models defined by acyclic directed mixed graphs. Such a class of graphs, composed of directed and bi-directed edges, is a representation of conditional independencies that is closed under marginalization and arises naturally from causal models which allow for unmeasured confounding. Monte Carlo methods and a variational approximation for such models are presented. Our algorithms for Bayesian inference allow the evaluation of posterior distributions for several quantities of interest, including causal effects that are not identifiable from data alone but could otherwise be inferred where informative prior knowledge about confounding is available.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257-1264. The MIT Press, Cambridge, MA, 2006.

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M<<N, where N is the number of real data points, and hence obtain a sparse regression method which has O(NM2) training cost and O(M2) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M, i.e. very sparse solutions, and it significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani. Variable noise and dimensionality reduction for sparse Gaussian processes. In R. Dechter and T. S. Richardson, editors, 22nd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2006.

Abstract: The sparse pseudo-input Gaussian process (SPGP) is a new approximation method for speeding up GP regression in the case of a large number of data points N. The approximation is controlled by the gradient optimization of a small set of M pseudo-inputs, thereby reducing complexity from O(N3) to O(NM2). One limitation of the SPGP is that this optimization space becomes impractically big for high dimensional data sets. This paper addresses this limitation by performing automatic dimensionality reduction. A projection of the input space to a low dimensional space is learned in a supervised manner, alongside the pseudo-inputs, which now live in this reduced space. The paper also investigates the suitability of the SPGP for modeling data with input-dependent noise. A further extension of the model is made to make it even more powerful in this regard - we learn an uncertainty parameter for each pseudo-input. The combination of sparsity, reduced dimension, and input-dependent noise makes it possible to apply GPs to much larger and more complex data sets than was previously practical. We demonstrate the benefits of these methods on several synthetic and real world problems.

Frank Wood, Thomas L. Griffiths, and Zoubin Ghahramani. A non-parametric Bayesian method for inferring hidden causes. In UAI. AUAI Press, 2006.

Abstract: We present a non-parametric Bayesian approach to structure learning with hidden causes. Previous Bayesian treatments of this problem define a prior over the number of hidden causes and use algorithms such as reversible jump Markov chain Monte Carlo to move between solutions. In contrast, we assume that the number of hidden causes is unbounded, but only a finite number influence observable variables. This makes it possible to use a Gibbs sampler to approximate the distribution over causal structures. We evaluate the performance of both approaches in discovering hidden causes in simulated data, and use our non-parametric approach to discover hidden causes in a real medical dataset.

Edward Snelson and Zoubin Ghahramani. Compact approximations to Bayesian predictive distributions. In 22nd International Conference on Machine Learning, Bonn, Germany, August 2005. Omnipress.

Abstract: We provide a general framework for learning precise, compact, and fast representations of the Bayesian predictive distribution for a model. This framework is based on minimizing the KL divergence between the true predictive density and a suitable compact approximation. We consider various methods for doing this, both sampling based approximations, and deterministic approximations such as expectation propagation. These methods are tested on a mixture of Gaussians model for density estimation and on binary linear classification, with both synthetic data sets for visualization and several real data sets. Our results show significant reductions in prediction time and memory footprint.

Matthew J. Beal, Francesco Falciani, Zoubin Ghahramani, Claudia Rangel, and David L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21(3):349-356, 2005.

Abstract: Motivation: We have used state-space models (SSMs) to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T cell activation. SSMs are a class of dynamic Bayesian networks in which the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects that cannot be directly measured in a gene expression profiling experiment, for example: genes that have not been included in the microarray, levels of regulatory proteins, the effects of mRNA and protein degradation, etc. Results: We have approached the problem of inferring the model structure of these state-space models using both classical and Bayesian methods. In our previous work, a bootstrap procedure was used to derive classical confidence intervals for parameters representing `gene-gene' interactions over time. In this article, variational approximations are used to perform the analogous model selection task in the Bayesian context. Certain interactions are present in both the classical and the Bayesian analyses of these regulatory networks. The resulting models place JunB and JunD at the centre of the mechanisms that control apoptosis and proliferation. These mechanisms are key for clonal expansion and for controlling the long term behavior (e.g. programmed cell death) of these cells.

Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6:1019-1041, 2005.

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach.

Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 137-144. ACM, 2005.

Abstract: In this paper, we propose a probabilistic kernel approach to preference learning based on Gaussian processes. A new likelihood function is proposed to capture the preference relations in the Bayesian framework. The generalized formulation is also applicable to tackle many multiclass problems. The overall approach has the advantages of Bayesian methods for model selection and probabilistic prediction. Experimental results compared against the constraint classification approach on several benchmark datasets verify the usefulness of this algorithm.

Wei Chu, Zoubin Ghahramani, Francesco Falciani, and David L. Wild. Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics, 21(16):3385-3393, 2005.

Abstract: MOTIVATION: In clinical practice, pathological phenotypes are often labelled with ordinal scales rather than binary, e.g. the Gleason grading system for tumour cell differentiation. However, in the literature of microarray analysis, these ordinal labels have been rarely treated in a principled way. This paper describes a gene selection algorithm based on Gaussian processes to discover consistent gene expression patterns associated with ordinal clinical phenotypes. The technique of automatic relevance determination is applied to represent the significance level of the genes in a Bayesian inference framework. RESULTS: The usefulness of the proposed algorithm for ordinal labels is demonstrated by the gene expression signature associated with the Gleason score for prostate cancer data. Our results demonstrate how multi-gene markers that may be initially developed with a diagnostic or prognostic application in mind are also useful as an investigative tool to reveal associations between specific molecular and cellular events and features of tumour physiology. Our algorithm can also be applied to microarray data with binary labels with results comparable to other methods in the literature.

Katherine A. Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 297-304. ACM, 2005.

Abstract: We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distance-based agglomerative clustering algorithms. (1) It defines a probabilistic model of the data which can be used to compute the predictive distribution of a test point and the probability of it belonging to any of the existing clusters in the tree. (2) It uses a model-based criterion to decide on merging clusters rather than an ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide which merges are advantageous and to output the recommended depth of the tree. (4) The algorithm can be interpreted as a novel fast bottom-up approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). It provides a new lower bound on the marginal likelihood of a DPM by summing over exponentially many clusterings of the data in polynomial time. We describe procedures for learning the model hyperpa-rameters, computing the predictive distribution, and extensions to the algorithm. Experimental results on synthetic and real-world data sets demonstrate useful properties of the algorithm.

Iain Murray, David J. C. MacKay, Zoubin Ghahramani, and John Skilling. Nested sampling for Potts models. In NIPS, 2005.

Abstract: Nested sampling is a new Monte Carlo method by Skilling intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model.

JaeMo Sung, Sung Yang Bang, Seungjin Choi, and Zoubin Ghahramani. U-likelihood and u-updating algorithms: Statistical inference in latent variable models. In João Gama, Rui Camacho, Pavel Brazdil, Alípio Jorge, and Luís Torgo, editors, ECML, volume 3720 of Lecture Notes in Computer Science, pages 377-388. Springer, 2005.

Abstract: In this paper we consider latent variable models and introduce a new U-likelihood concept for estimating the distribution over hidden variables. One can derive an estimate of parameters from this distribution. Our approach differs from the Bayesian and Maximum Likelihood (ML) approaches. It gives an alternative to Bayesian inference when we don't want to define a prior over parameters and gives an alternative to the ML method when we want a better estimate of the distribution over hidden variables. As a practical implementation, we present a U-updating algorithm based on the mean field theory to approximate the distribution over hidden variables from the U-likelihood. This algorithm captures some of the correlations among hidden variables by estimating reaction terms. Those reaction terms are found to penalize the likelihood. We show that the U-updating algorithm becomes the EM algorithm as a special case in the large sample limit. The useful behavior of our method is confirmed for the case of mixture of Gaussians by comparing to the EM algorithm.

Jian Zhang, Zoubin Ghahramani, and Yiming Yang. Learning multiple related tasks using latent independent component analysis. In NIPS, 2005.

Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efficient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classification data sets show that the proposed approach is promising.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian processes. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 337-344, Cambridge, MA, USA, December 2004. The MIT Press.

Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.

Philip E. Bourne, C. K. J. Allerston, Werner G. Krebs, Wilfred W. Li, Ilya N. Shindyalov, Adam Godzik, Iddo Friedberg, Tong Liu, David L. Wild, Seungwoo Hwang, Zoubin Ghahramani, Li Chen, and John D. Westbrook. The status of structural genomics defined through the analysis of current targets and structures. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and Teri E. Klein, editors, Pacific Symposium on Biocomputing, pages 375-386. World Scientific, 2004.

Abstract: Structural genomics-large-scale macromolecular 3-dimenional structure determination-is unique in that major participants report scientific progress on a weekly basis. The target database (TargetDB) maintained by the Protein Data Bank (http://targetdb.pdb.org) reports this progress through the status of each protein sequence (target) under consideration by the major structural genomics centers worldwide. Hence, TargetDB provides a unique opportunity to analyze the potential impact that this major initiative provides to scientists interested in the sequence-structure-function-disease paradigm. Here we report such an analysis with a focus on: (i) temporal characteristics-how is the project doing and what can we expect in the future? (ii) target characteristics-what are the predicted functions of the proteins targeted by structural genomics and how biased is the target set when compared to the PDB and to predictions across complete genomes? (iii) structures solved-what are the characteristics of structures solved thus far and what do they contribute? The analysis required a more extensive database of structure predictions using different methods integrated with data from other sources. This database, associated tools and related data sources are available from http://spam.sdsc.edu.

Wei Chu, Zoubin Ghahramani, and David L. Wild. A graphical model for protein secondary structure prediction. In Carla E. Brodley, editor, ICML, volume 69 of ACM International Conference Proceeding Series. ACM, 2004.

Abstract: In this paper, we present a graphical model for protein secondary structure prediction. This model extends segmental semi-Markov models (SSMM) to exploit multiple sequence alignment profiles which contain information from evolutionarily related sequences. A novel parameterized model is proposed as the likelihood function for the SSMM to capture the segmental conformation. By incorporating the information from long range interactions in β-sheets, this model is capable of carrying out inference on contact maps. The numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising.

Wei Chu, Zoubin Ghahramani, and David L. Wild. Protein secondary structure prediction using sigmoid belief networks to parameterize segmental semi-Markov models. In ESANN, pages 81-86, 2004.

Abstract: In this paper, we merge the parametric structure of neural networks into a segmental semi-Markov model to set up a Bayesian framework for protein structure prediction. The parametric model, which can also be regarded as an extension of a sigmoid belief network, captures the underlying dependency in residue sequences. The results of numerical experiments indicate the usefulness of this approach.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and David L. Wild. Clustering protein sequence and structure space with infinite Gaussian mixture models. In Pacific Symposium on Biocomputing 2004, pages 399-410, Singapore, 2004. World Scientific Publishing.

Abstract: We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and David L. Wild. Clustering protein sequence and structure space with infinite Gaussian mixture models. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and Teri E. Klein, editors, Pacific Symposium on Biocomputing, pages 399-410. World Scientific, 2004.

Abstract: We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/approximately wid/PSB04/index.html

Iain Murray and Zoubin Ghahramani. Bayesian learning in undirected graphical models: Approximate MCMC algorithms. In David Maxwell Chickering and Joseph Y. Halpern, editors, UAI, pages 392-399. AUAI Press, 2004.

Abstract: Bayesian learning in undirected graphical models|computing posterior distributions over parameters and predictive quantities is exceptionally difficult. We conjecture that for general undirected models, there are no tractable MCMC (Markov Chain Monte Carlo) schemes giving the correct equilibrium distribution over parameters. While this intractability, due to the partition function, is familiar to those performing parameter optimisation, Bayesian learning of posterior distributions over undirected model parameters has been unexplored and poses novel challenges. we propose several approximate MCMC schemes and test on fully observed binary models (Boltzmann machines) for a small coronary heart disease data set and larger artificial systems. While approximations must perform well on the model, their interaction with the sampling scheme is also important. Samplers based on variational mean- field approximations generally performed poorly, more advanced methods using loopy propagation, brief sampling and stochastic dynamics lead to acceptable parameter posteriors. Finally, we demonstrate these techniques on a Markov random field with hidden variables.

Yuan (Alan) Qi, Thomas P. Minka, Rosalind W. Picard, and Zoubin Ghahramani. Predictive automatic relevance determination by expectation propagation. In Carla E. Brodley, editor, ICML, volume 69 of ACM International Conference Proceeding Series. ACM, 2004.

Abstract: In many real-world classification problems the input contains a large number of potentially irrelevant features. This paper proposes a new Bayesian framework for determining the relevance of input features. This approach extends one of the most successful Bayesian methods for feature selection and sparse learning, known as Automatic Relevance Determination (ARD). ARD finds the relevance of features by optimizing the model marginal likelihood, also known as the evidence. We show that this can lead to overfitting. To address this problem, we propose Predictive ARD based on estimating the predictive performance of the classifier. While the actual leave-one-out predictive performance is generally very costly to compute, the expectation propagation (EP) algorithm proposed by Minka provides an estimate of this predictive performance as a side-effect of its iterations. We exploit this in our algorithm to do feature selection, and to select data points in a sparse Bayesian kernel classifier. Moreover, we provide two other improvements to previous algorithms, by replacing Laplace's approximation with the generally more accurate EP, and by incorporating the fast optimization algorithm proposed by Faul and Tipping. Our experiments show that our method based on the EP estimate of predictive performance is more accurate on test data than relevance determination by optimizing the evidence.

Claudia Rangel, John Angus, Zoubin Ghahramani, Maria Lioumi, Elizabeth Sotheran, Alessia Gaiba, David L. Wild, and Francesco Falciani. Modeling t-cell activation using gene expression profiling and state-space models. Bioinformatics, 20(9):1361-1372, 2004.

Abstract: Motivation: We have used state-space models to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T-cell activation. State space models are a class of dynamic Bayesian networks that assume that the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects that cannot be measured in a gene expression profiling experiment, e.g. genes that have not been included in the microarray, levels of regulatory proteins, the effects of messenger RNA and protein degradation, etc. Results: Bootstrap confidence intervals are developed for parameters representing `gene-gene' interactions over time. Our models represent the dynamics of T-cell activation and provide a methodology for the development of rational and experimentally testable hypotheses. Availability: Supplementary data and Matlab computer source code will be made available on the web at the URL given below. Supplementary information: .

Sebastian Thrun, Yufeng Liu, Daphne Koller, Andrew Y. Ng, Zoubin Ghahramani, and Hugh F. Durrant-Whyte. Simultaneous localization and mapping with sparse extended information filters. I. J. Robotic Res., 23(7-8):693-716, 2004.

Abstract: This paper describes a scalable algorithm for the simultaneous mapping and localization (SLAM) problem. SLAM is the problem of acquiring a map of a static environment with a mobile robot. The vast majority of SLAM algorithms are based on the extended Kalman filter (EKF). This paper advocates an algorithm that relies on the dual of the EKF, the extended information filter (EIF). We show that when represented in the information form, map posteriors are dominated by a small number of links that tie together nearby features in the map. This insight is developed into a sparse variant of the EIF, called the sparse extended information filters (SEIF). SEIFs represent maps by graphical networks of features that are locally interconnected, where links represent relative information between pairs of nearby features, as well as information about the robot's pose relative to the map. We show that all essential update equations in SEIFs can be executed in constant time, irrespective of the size of the map. We also provide empirical results obtained for a benchmark data set collected in an outdoor environment, and using a multi-robot mapping simulation.

Jian Zhang, Zoubin Ghahramani, and Yiming Yang. A probabilistic model for online document clustering with application to novelty detection. In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors, NIPS. MIT Press, 2004.

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, and John D. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors, NIPS. MIT Press, 2004.

Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results.

Carl Edward Rasmussen and Zoubin Ghahramani. Bayesian Monte Carlo. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 489-496, Cambridge, MA, USA, December 2003. The MIT Press.

Abstract: We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

Zoubin Ghahramani. Unsupervised Learning. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors, Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes in Computer Science, pages 72-112. Springer, 2003.

Abstract: We give a tutorial and overview of the field of unsupervised learning from the perspective of statistical modelling. Unsupervised learning can be motivated from information theoretic and Bayesian principles. We briefly review basic models in unsupervised learning, including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov models, state-space models, and many variants and extensions. We derive the EM algorithm and give an overview of fundamental concepts in graphical models, and inference algorithms on graphs. This is followed by a quick tour of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC), Laplace approximation, BIC, variational approximations, and expectation propagation (EP). The aim of this chapter is to provide a high-level view of the field. Along the way, many state-of-the-art ideas and future directions are also reviewed.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani. On the convergence of bound optimization algorithms. In Christopher Meek and Uffe Kjærulff, editors, UAI, pages 509-516. Morgan Kaufmann, 2003.

Abstract: Many practitioners who use EM and related algorithms complain that they are sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms - including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP - and their relationship to direct optimization algorithms such as gradientbased methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice.

Ruslan Salakhutdinov, Sam T. Roweis, and Zoubin Ghahramani. Optimization with EM and expectation-conjugate-gradient. In Tom Fawcett and Nina Mishra, editors, ICML, pages 672-679. AAAI Press, 2003.

Abstract: We show a close relationship between the Expectation- Maximization (EM) algorithm and direct optimization algorithms such as gradientbased methods for parameter learning. We identify analytic conditions under which EM exhibits Newton-like behavior, and conditions under which it possesses poor, first-order convergence. Based on this analysis, we propose two novel algorithms for maximum likelihood estimation of latent variable models, and report empirical results showing that, as predicted by theory, the proposed new algorithms can substantially outperform standard EM in terms of speed of convergence in certain cases.

Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Tom Fawcett and Nina Mishra, editors, ICML, pages 912-919. AAAI Press, 2003.

Abstract: An approach to semi-supervised learning is proposed that is based on a Gaussian random field model. Labeled and unlabeled data are represented as vertices in a weighted graph, with edge weights encoding the similarity between instances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propagation. The resulting learning algorithms have intimate connections with random walks, electric networks, and spectral graph theory. We discuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text classification tasks.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. The infinite hidden Markov model. In Advances in Neural Information Processing Systems 14, pages 577-584, Cambridge, MA, USA, December 2002. The MIT Press.

Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite - consider, for example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of Gaussian process experts. In Advances in Neural Information Processing Systems 14, pages 881-888, Cambridge, MA, USA, December 2002. The MIT Press.

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, NIPS, pages 897-904. MIT Press, 2002.

Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naïve methods degrade rapidly as more ambiguity is introduced into the labels.

A. Raval, Zoubin Ghahramani, and David L. Wild. A Bayesian network model for protein fold and remote homologue recognition. Bioinformatics, 18(6):788-801, 2002.

Abstract: Motivation: The Bayesian network approach is a framework which combines graphical representation and probability theory, which includes, as a special case, hidden Markov models. Hidden Markov models trained on amino acid sequence or secondary structure data alone have been shown to have potential for addressing the problem of protein fold and superfamily classification. Results: This paper describes a novel implementation of a Bayesian network which simultaneously learns amino acid sequence, secondary structure and residue accessibility for proteins of known three-dimensional structure. An awareness of the errors inherent in predicted secondary structure may be incorporated into the model by means of a confusion matrix. Training and validation data have been derived for a number of protein superfamilies from the Structural Classification of Proteins (SCOP) database. Cross validation results using posterior probability classification demonstrate that the Bayesian network performs better in classifying proteins of known structural superfamily than a hidden Markov model trained on amino acid sequences alone.

Naonori Ueda and Zoubin Ghahramani. Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks, 15(10):1223-1241, 2002.

Abstract: When learning a mixture model, we suffer from the local optima and model structure determination problems. In this paper, we present a method for simultaneously solving these problems based on the variational Bayesian (VB) framework. First, in the VB framework, we derive an objective function that can simultaneously optimize both model parameter distributions and model structure. Next, focusing on mixture models, we present a deterministic algorithm to approximately optimize the objective function by using the idea of the split and merge operations which we previously proposed within the maximum likelihood framework. Then, we apply the method to mixture of expers (MoE) models to experimentally show that the proposed method can find the optimal number of experts of a MoE while avoiding local maxima. q 2002 Elsevier Science Ltd. All rights reserved.

Carl Edward Rasmussen and Zoubin Ghahramani. Occam's razor. In Advances in Neural Information Processing Systems 13, pages 294-300, Cambridge, MA, USA, December 2001. The MIT Press.

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

Zoubin Ghahramani. An introduction to hidden Markov models and Bayesian networks. IJPRAI, 15(1):9-42, 2001.

Abstract: We provide a tutorial on learning and inference in hidden Markov models in the context of the recent literature on Bayesian networks. This perspective make sit possible to consider novel generalizations to hidden Markov models with multiple hidden state variables, multiscale representations, and mixed discrete and continuous variables. Although exact inference in these generalizations is usually intractable, one can use approximate inference in these generalizations is usually intractable, one can use approximate inference algorithms such as Markov chain sampling and variational methods. We describe how such methods are applied to these generalized hidden Markov models. We conclude this review with a discussion of Bayesian methods for model selection in generalized HMMs.

Nicholas J. Adams, Amos J. Storkey, Christopher K. I. Williams, and Zoubin Ghahramani. MFDTs: Mean field dynamic trees. In International Conference on Pattern Recognition, volume 3, pages 151-154, 2000.

Abstract: Tree structured belief networks are attractive for image segmentation tasks. However, networks with fixed architectures are not very suitable as they lead to blocky artefacts, and led to the introduction of dynamic trees (DTs). The Dynamic trees architecture provide a prior distribution over tree structures, and simulated annealing (SA) was used to search for structures with high posterior probability. In this paper we introduce a mean field approach to inference in DTs. We find that the mean field method captures the posterior better than just using the maximum a posteriori solution found by SA

Zoubin Ghahramani and Matthew J. Beal. Propagation algorithms for variational Bayesian learning. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 507-513. The MIT Press, 2000.

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set.

Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831-864, 2000.

Abstract: We introduce a new statistical model for time series which iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time series models-hidden Markov models and linear dynamical systems-and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs et al, 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact Expectation Maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log likelihood and makes use of both the forward-backward recursions for hidden Markov models and the Kalman filter recursions for linear dynamical systems. We tested the algorithm both on artificial data sets and on a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models.

Carl Edward Rasmussen and Zoubin Ghahramani. Occam's razor. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 294-300. The MIT Press, 2000.

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton. SMEM algorithm for mixture models. Neural Computation, 12(9):2109-2128, 2000.

Abstract: We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split-and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton. Split and merge EM algorithm for improving Gaussian mixture density estimates. VLSI Signal Processing, 26(1-2):133-140, 2000.

Abstract: We present a split and merge EM algorithm to overcome the local maximum problem in Gaussian mixture density estimation. Nonglobal maxims often involve having too many Gaussians in one part of the space and too few in another, widely separated part of the space. To escape from such configurations we repeatedly perform split and merge operations using a new criterion for efficiently selecting the split and merge candidates. Experimental results on synthetic and real data show the effectiveness of using the split and merge operations to improve the likelihood of both the training data and of held-out test data

Zoubin Ghahramani and Matthew J. Beal. Variational inference for Bayesian mixtures of factor analysers. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 449-455. The MIT Press, 1999.

Abstract: We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to overfitting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples. By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exact predictive density, and the KL divergence between the variational posterior and the true posterior, not only in this model but for variational approximations in general.

Zoubin Ghahramani, Alexander T Korenberg, and Geoffrey E Hinton. Scaling in a hierarchical unsupervised network. In Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ. No. 470), volume 1, pages 13-18. IET, 1999.

Abstract: A persistent worry with computational models of unsupervised learning is that learning will become more difficult as the problem is scaled. We examine this issue in the context of a novel hierarchical, generative model that can be viewed as a non-linear generalization of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We first demonstrate that the model can extract a sparse, distributed, hierarchical representation of global disparity from simplified random-dot stereograms. We then investigate some of the scaling properties of the algorithm on this problem and find that : (1) increasing the image size leads to faster and more reliable learning; (2) Increasing the depth of the network from one to two hidden layers leads to better representations at the first hidden layer, and (3) Once one part of the network has discovered how to represent disparity, it 'supervises' other parts of the network, greatly speeding up their learning.

Geoffrey E. Hinton, Zoubin Ghahramani, and Yee Whye Teh. Learning to parse images. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 463-469. The MIT Press, 1999.

Abstract: We describe a class of probabilistic models that we call credibility networks. Using parse trees as internal representations of images, credibility networks are able to perform segmentation and recognition simultaneously, removing the need for ad hoc segmentation heuristics. Promising results in the problem of segmenting handwritten digits were obtained.

Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183-233, 1999.

Abstract: This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.

Sam T. Roweis and Zoubin Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305-345, 1999.

Abstract: Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model. We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models.

Zoubin Ghahramani and Sam T. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 431-437. The MIT Press, 1998.

Abstract: The Expectation Maximization (EM) algorithm is an iterative procedure for maximum likelihood parameter estimation from data sets with missing or hidden variables. It has been applied to system identification in linear stochastic state-space models, where the state variables are hidden from the observer and both the state and the parameters of the model have to be estimated simultaneously [9]. We present a generalization of the EM algorithm for parameter estimation in nonlinear dynamical systems. The ``expectation'' step makes use of Extended Kalman Smoothing to estimate the state, while the ``maximization'' step re-estimates the parameters using these uncertain state estimates. In general, the nonlinear maximization step is difficult because it requires integrating out the uncertainty in the states. However, if Gaussian radial basis function (RBF) approximators are used to model the nonlinearities, the integrals become tractable and the maximization step can be solved via systems of linear equations.

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton. SMEM algorithm for mixture models. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 599-605. The MIT Press, 1998.

Abstract: We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split- and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems.

Zoubin Ghahramani and Geoffrey E. Hinton. Hierarchical non-linear factor analysis and topographic maps. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, NIPS. The MIT Press, 1997.

Abstract: We first describe a hierarchical, generative model that can be viewed as a non-linear generalisation of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We then show how to incorporate lateral connections into the generative model. The model extracts a sparse, distributed, hierarchical representation of depth from simplified random-dot stereograms and the localised disparity detectors in the first hidden layer form a topographic map. When presented with image patches from natural scenes, the model develops topographically organised local feature detectors.

Zoubin Ghahramani. Learning dynamic Bayesian networks. In C. Lee Giles and Marco Gori, editors, Adaptive Processing of Sequences and Data Structures, volume 1387 of Lecture Notes in Computer Science, pages 168-197. Springer, 1997.

Abstract: Bayesian networks are a concise graphical formalism for describing probabilistic models. We have provided a brief tutorial of methods for learning and inference in dynamic Bayesian networks. In many of the interesting models, beyond the simple linear dynamical system or hidden Markov model, the calculations required for inference are intractable. Two different approaches for handling this intractability are Monte Carlo methods such as Gibbs sampling, and variational methods. An especially promising variational approach is based on exploiting tractable substructures in the Bayesian network.

Zoubin Ghahramani and Michael I. Jordan. Factorial hidden Markov models. Machine Learning, 29(2-3):245-273, 1997.

Abstract: Hidden Markov models (HMMs) have proven to be one of the most widely used tools for learning probabilistic models of time series data. In an HMM, information about the past is conveyed through a single discrete variable-the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner. We describe an exact algorithm for inferring the posterior probabilities of the hidden state variables given the observations, and relate it to the forward-backward algorithm for HMMs and to algorithms for more general graphical models. Due to the combinatorial nature of the hidden state representation, this exact algorithm is intractable. As in other intractable systems, approximate inference can be carried out using Gibbs sampling or variational methods. Within the variational framework, we present a structured approximation in which the the state variables are decoupled, yielding a tractable algorithm for learning the parameters of the model. Empirical comparisons suggest that these approximations are efficient and provide accurate alternatives to the exact methods. Finally, we use the structured approximation to model Bach's chorales and show that factorial HMMs can capture statistical structure in this data set which an unconstrained HMM cannot.

Geoffrey E Hinton and Zoubin Ghahramani. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 352(1358):1177-1190, 1997.

Abstract: We describe a hierarchical, generative model that can be viewed as a nonlinear generalization of factor analysis and can be implemented in a neural network. The model uses bottom–up, top–down and lateral connections to perform Bayesian perceptual inference correctly. Once perceptual inference has been performed the connection strengths can be updated using a very simple learning rule that only requires locally available information. We demonstrate that the network learns to extract sparse, distributed, hierarchical representations.

David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active learning with statistical models. J. Artif. Intell. Res. (JAIR), 4:129-145, 1996.

Abstract: For many types of machine learning algorithms, one can compute the statistically ``optimal'' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.

Michael I. Jordan, Zoubin Ghahramani, and Lawrence K. Saul. Hidden Markov decision trees. In Michael Mozer, Michael I. Jordan, and Thomas Petsche, editors, NIPS, pages 501-507. MIT Press, 1996.

Abstract: We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike Revow, Zoubin Ghahramani, Rafal Kustra, and Robert Tibshirani. The DELVE manual, 1996.

Abstract: DELVE - Data for Evaluating Learning in Valid Experiments - is a collection of datasets from many sources, an environment within which this data can be used to assess the performance of methods for learning relationships from data, and a repository for the results of such experiments.

Comment: The delve website.

Zoubin Ghahramani and Michael I. Jordan. Factorial hidden Markov models. In David S. Touretzky, Michael Mozer, and Michael E. Hasselmo, editors, NIPS, pages 472-478. MIT Press, 1995.

Abstract: Hidden Markov models (HMMs) have proven to be one of the most widely used tools for learning probabilistic models of time series data. In an HMM, information about the past is conveyed through a single discrete variable-the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner. We describe an exact algorithm for inferring the posterior probabilities of the hidden state variables given the observations, and relate it to the forward-backward algorithm for HMMs and to algorithms for more general graphical models. Due to the combinatorial nature of the hidden state representation, this exact algorithm is intractable. As in other intractable systems, approximate inference can be carried out using Gibbs sampling or variational methods. Within the variational framework, we present a structured approximation in which the the state variables are decoupled, yielding a tractable algorithm for learning the parameters of the model. Empirical comparisons suggest that these approximations are efficient and provide accurate alternatives to the exact methods. Finally, we use the structured approximation to model Bach's chorales and show that factorial HMMs can capture statistical structure in this data set which an unconstrained HMM cannot.

David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active learning with statistical models. In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, NIPS, pages 705-712. MIT Press, 1994.

Abstract: For many types of machine learning algorithms, one can compute the statistically ``optimal'' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.

Zoubin Ghahramani. Factorial learning and the EM algorithm. In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, NIPS, pages 617-624. MIT Press, 1994.

Abstract: Many real world learning problems are best characterized by an interaction of multiple independent causes or factors. Discovering such causal structure from the data is the focus of this paper. Based on Zemel and Hinton's cooperative vector quantizer (CVQ) architecture, an unsupervised learning algorithm is derived from the Expectation-Maximization (EM) framework. Due to the combinatorial nature of the data generation process, the exact E-step is computationally intractable. Two alternative methods for computing the E-step are proposed: Gibbs sampling and mean-field approximation, and some promising empirical results are presented.

Zoubin Ghahramani, Daniel M. Wolpert, and Michael I. Jordan. Computational structure of coordinate transformations: A generalization study. In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, NIPS, pages 1125-1132. MIT Press, 1994.

Abstract: One of the fundamental properties that both neural networks and the central nervous system share is the ability to learn and generalize from examples. While this property has been studied extensively in the neural network literature it has not been thoroughly explored in human perceptual and motor learning. We have chosen a coordinate transformation system-the visuomotor map which transforms visual coordinates into motor coordinates-to study the generalization effects of learning new input-output pairs. Using a paradigm of computer controlled altered visual feedback, we have studied the generalization of the visuomotor map subsequent to both local and context-dependent remappings. A local remapping of one or two input-output pairs induced a significant global, yet decaying, change in the visuomotor map, suggesting a representation for the map composed of units with large functional receptive fields. Our study of context-dependent remappings indicated that a single point in visual space can be mapped to two different finger locations depending on a context variable-the starting point of the movement. Furthermore, as the context is varied there is a gradual shift between the two remappings, consistent with two visuomotor modules being learned and gated smoothly with the context.

Daniel M. Wolpert, Zoubin Ghahramani, and Michael I. Jordan. Forward dynamic models in human motor control: Psychophysical evidence. In Gerald Tesauro, David S. Touretzky, and Todd K. Leen, editors, NIPS, pages 43-50. MIT Press, 1994.

Abstract: Based on computational principles, with as yet no direct experimental validation, it has been proposed that the central nervous system (CNS) uses an internal model to simulate the dynamic behavior of the motor system in planning, control and learning. We present experimental results and simulations based on a novel approach that investigates the temporal propagation of errors in the sensorimotor integration process. Our results provide direct support for the existence of an internal model.

Zoubin Ghahramani and Michael I. Jordan. Supervised learning from incomplete data via an EM approach. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, NIPS, pages 120-127. Morgan Kaufmann, 1993.

Abstract: Real-world learning tasks may involve high-dimensional data sets with arbitrary patterns of missing data. In this paper we present a framework based on maximum likelihood density estimation for learning from such data sets. We use mixture models for the density estimates and make two distinct appeals to the ExpectationMaximization (EM) principle (Dempster et al., 1977) in deriving a learning algorithm-EM is used both for the estimation of mixture components and for coping with missing data. The resulting algorithm is applicable to a wide range of supervised as well as unsupervised learning problems. Results from a classification benchmark-the iris data set-are presented.


Creighton Heaukulani

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. Beta diffusion trees and hierarchical feature allocations. Technical report, Dept. of Engineering, University of Cambridge, August 2014.

Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. Beta diffusion trees. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

Creighton Heaukulani and Daniel M. Roy. The combinatorial structure of beta negative binomial processes. Technical report, Dept. of Engineering, University of Cambridge, March 2014.

Abstract: We characterize the combinatorial structure of conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure. In Bayesian nonparametric applications, such processes have served as models for unknown multisets of a measurable space. Previous work has characterized random subsets arising from conditionally-i.i.d. sequences of Bernoulli processes with a common beta process base measure. In this case, the combinatorial structure is described by the Indian buffet process. Our results give a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide constructions for the beta negative binomial process that avoid a representation of the underlying beta process base measure.

Creighton Heaukulani and Zoubin Ghahramani. Dynamic probabilistic models for latent feature propagation in social networks. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Current Bayesian models for dynamic social network data have focused on modelling the influence of evolving unobserved structure on observed social interactions. However, an understanding of how observed social relationships from the past affect future unobserved structure in the network has been neglected. In this paper, we introduce a new probabilistic model for capturing this phenomenon, which we call latent feature propagation, in social networks. We demonstrate our model's capability for inferring such latent structure in varying types of social network datasets, and experimental studies show this structure achieves higher predictive performance on link prediction and forecasting tasks.


Katherine Heller

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani. Bayesian and L1 approaches for sparse unsupervised learning. In 29th International Conference on Machine Learning, 2012.

Abstract: The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of "L1", and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods out-perform L1 minimisation, even on a com- putational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, and Thomas L. Griffiths. Testing a Bayesian measure of representativeness using a large image database. In Advances in Neural Information Processing Systems 24, Cambridge, MA, USA, 2011. The MIT Press.

Abstract: How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to define a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories.

Sinead Williamson, Katherine A. Heller, C. Wang, and D. M. Blei. The IBP compound Dirichlet process and its application to focused topic modeling. In 27th International Conference on Machine Learning, pages 1151-1158, Haifa, Israel, June 2010.

Abstract: The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model - each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point's "focus", is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.

R. Silva, K. A. Heller, Z. Ghahramani, and E. M. Airoldi. Ranking relations using analogies in biological and information networks. Annals of Applied Statistics, 4(2):615-644, 2010.

Abstract: Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = A(1):B(1), A(2):B(2), ..., A(N):B(N), measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani. Bayesian exponential family PCA. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1089-1096, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.

Comment: spotlight.

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant, K. Denby, and D. L. Wild. R/BHC: fast Bayesian hierarchical clustering for microarray data. BMC Bioinformatics 2009, 10(242):1-9, August 2009, doi 10.1186/1471-2105-10-242.

Abstract: Background: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained.
Results: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge.
Conclusion: Biologically plausible results are presented from a well studied data set: expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.

Yang Xu, Katherine A. Heller, and Zoubin Ghahramani. Tree-based inference for Dirichlet process mixtures. In D. van Dyk and M. Welling, editors, 12th International Conference on Artificial Intelligence and Statistics, volume 5, pages 623-630, Clearwater Beach, FL, USA, April 2009. Microtome Publishing (paper), Journal of Machine Learning Research (online). ISSN 1938-7228.

Abstract: The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density estimation. Unfortunately, like in many statistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational Bayesian (VB) [4] and collapsed variational methods [5], [6] recently introduced a novel class of approximation for DPMs based on Bayesian hierarchical clustering (BHC). These tree-based combinatorial approximations efficiently sum over exponentially many ways of partitioning the data and offer a novel lower bound on the marginal likelihood of the DPM [6]. In this paper we make the following contributions: (1) We show empirically that the BHC lower bounds are substantially tighter than the bounds given by VB [4] and by collapsed variational methods [5] on synthetic and real datasets. (2) We also show that BHC offers a more accurate predictive performance on these datasets. (3) We further improve the tree-based lower bounds with an algorithm that efficiently sums contributions from alternative trees. (4) We present a fast approximate method for BHC. Our results suggest that our combinatorial approximate inference methods and lower bounds may be useful not only in DPMs but in other models as well.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani. Statistical models for partial membership. In Andrew McCallum and Sam Roweis, editors, 25th International Conference on Machine Learning, pages 392-399, Helsinki, Finland, July 2008. Omnipress.

Abstract: We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data (Gasch & Eisen, 2002). Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

Katherine A. Heller and Zoubin Ghahramani. A nonparametric Bayesian approach to modeling overlapping clusters. In Marina Meila and Xiaotong Shen, editors, AISTATS, volume 2 of JMLR Proceedings, pages 187-194. JMLR.org, 2007.

Abstract: Although clustering data into mutually exclusive partitions has been an extremely successful approach to unsupervised learning, there are many situations in which a richer model is needed to fully represent the data. This is the case in problems where data points actually simultaneously belong to multiple, overlapping clusters. For example a particular gene may have several functions, therefore belonging to several distinct clusters of genes, and a biologist may want to discover these through unsupervised modeling of gene expression data. We present a new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions, much like products of experts (Hinton, 2002). The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and Ghahramani, 2006). The IOMM has the desirable properties of being able to focus in on overlapping regions while maintaining the ability to model a potentially infinite number of clusters which may overlap. We derive MCMC inference algorithms for the IOMM and show that these can be used to cluster movies into multiple genres.

Ricardo Silva, Katherine A. Heller, and Zoubin Ghahramani. Analogical reasoning with relational Bayesian sets. In Marina Meila and Xiaotong Shen, editors, AISTATS, volume 2 of JMLR Proceedings, pages 500-507. JMLR.org, 2007.

Abstract: Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. There are many ways in which objects can be related, making automated analogical reasoning very chal- lenging. Here we develop an approach which, given a set of pairs of related objects S = A1:B1,A2:B2,...,AN:BN, measures how well other pairs A:B fit in with the set S. This addresses the question: is the relation between objects A and B analogous to those relations found in S? We recast this classi- cal problem as a problem of Bayesian analy- sis of relational data. This problem is non- trivial because direct similarity between ob- jects is not a good way of measuring analo- gies. For instance, the analogy between an electron around the nucleus of an atom and a planet around the Sun is hardly justified by isolated, non-relational, comparisons of an electron to a planet, and a nucleus to the Sun. We develop a generative model for predicting the existence of relationships and extend the framework of Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous a relation is to other relations. This sheds new light on an old problem, which we motivate and illustrate through practical applications in exploratory data analysis.

Zoubin Ghahramani and Katherine A. Heller. Bayesian sets. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 435-442, Cambridge, MA, USA, December 2006. The MIT Press.

Abstract: Inspired by "Google™ Sets", we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a model-based concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions.

Katherine A. Heller and Zoubin Ghahramani. A simple Bayesian framework for content-based image retrieval. In CVPR, pages 2110-2117. IEEE Computer Society, 2006.

Abstract: We present a Bayesian framework for content-based image retrieval which models the distribution of color and texture features within sets of related images. Given a userspecified text query (e.g. "penguins") the system first extracts a set of images, from a labelled corpus, corresponding to that query. The distribution over features of these images is used to compute a Bayesian score for each image in a large unlabelled corpus. Unlabelled images are then ranked using this score and the top images are returned. Although the Bayesian score is based on computing marginal likelihoods, which integrate over model parameters, in the case of sparse binary data the score reduces to a single matrix-vector multiplication and is therefore extremely efficient to compute. We show that our method works surprisingly well despite its simplicity and the fact that no relevance feedback is used. We compare different choices of features, and evaluate our results using human subjects.

Katherine A. Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 297-304. ACM, 2005.

Abstract: We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distance-based agglomerative clustering algorithms. (1) It defines a probabilistic model of the data which can be used to compute the predictive distribution of a test point and the probability of it belonging to any of the existing clusters in the tree. (2) It uses a model-based criterion to decide on merging clusters rather than an ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide which merges are advantageous and to output the recommended depth of the tree. (4) The algorithm can be interpreted as a novel fast bottom-up approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). It provides a new lower bound on the marginal likelihood of a DPM by summing over exponentially many clusterings of the data in polynomial time. We describe procedures for learning the model hyperpa-rameters, computing the predictive distribution, and extensions to the algorithm. Experimental results on synthetic and real-world data sets demonstrate useful properties of the algorithm.


José Miguel Hernández-Lobato

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Cold-start active learning with robust ordinal matrix factorization. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We present a new matrix factorization model for rating data and a corresponding active learning strategy to address the cold-start problem. Cold-start is one of the most challenging tasks for recommender systems: what to recommend with new users or items for which one has little or no data. An approach is to use active learning to collect the most useful initial ratings. However, the performance of active learning depends strongly upon having accurate estimates of i) the uncertainty in model parameters and ii) the intrinsic noisiness of the data. To achieve these estimates we propose a heteroskedastic Bayesian model for ordinal matrix factorization. We also present a computationally efficient framework for Bayesian active learning with this type of complex probabilistic model. This algorithm successfully distinguishes between informative and noisy data points. Our model yields state-of-the-art predictive performance and, coupled with our active learning strategy, enables us to gain useful information in the cold-start setting from the very first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix factorization with non-random missing data. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We propose a probabilistic matrix factorization model for collaborative filtering that learns from data that is missing not at random (MNAR). Matrix factorization models exhibit state-of-the-art predictive performance in collaborative filtering. However, these models usually assume that the data is missing at random (MAR), and this is rarely the case. For example, the data is not MAR if users rate items they like more than ones they dislike. When the MAR assumption is incorrect, inferences are biased and predictive performance can suffer. Therefore, we model both the generative process for the data and the missing data mechanism. By learning these two models jointly we obtain improved performance over state-of-the-art methods when predicting the ratings and when modeling the data observation process. We present the first viable MF model for MNAR data. Our results are promising and we expect that further research on NMAR models will yield large gains in collaborative filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Stochastic inference for scalable probabilistic modeling of binary matrices. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient because they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data used by our method. For this, we derive an algorithm that adjusts this hyper-parameter online.

Daniel Hernández-Lobato and José Miguel Hernández-Lobato. Learning feature selection dependencies in multi-task learning. In Advances in Neural Information Processing Systems 27, pages 1-9, Lake Tahoe, California, USA, December 2013.

Abstract: A probabilistic model based on the horseshoe prior is proposed for learning de- pendencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel Hernández-Lobato. Gaussian process conditional copulas with applications to financial time series. In Advances in Neural Information Processing Systems 27, pages 1-9, Lake Tahoe, California, USA, December 2013.

Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other time-varying copula methods.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont. Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation. Journal of Machine Learning Research, 14:1891-1945, July 2013.

Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Gaussian process vine copulas for multivariate dependence. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

Yue Wu, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Dynamic covariance models for multivariate financial time series. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Stochastic inference for scalable probabilistic modeling of binary matrices. In NIPS Workshop on Randomized Methods for Machine Learning, 2013.

Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient since they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data and we propose an algorithm that adjusts this hyper-parameter in an online manner.

Michael Kaschesky, Pawel Sobkowicz, José Miguel Hernández-Lobato, Guillaume Bouchard, Cedric Archambeau, Nicolas Scharioth, Robert Manchin, Adrian Gschwend, and Reinhard Riedl. Bringing representativeness into social media monitoring and analysis. In 46th Hawaii International Conference on System Sciences, Manoa, Hawaii, 2013.

Abstract: The opinions, expectations and behavior of citizens are increasingly reflected online - therefore, mining the internet for such data can enhance decision-making in public policy, communications, marketing, finance and other fields. However, to come closer to the representativeness of classic opinion surveys there is a lack of knowledge about the sociodemographic characteristics of those voicing opinions on the internet. This paper proposes to calibrate online opinions aggregated from multiple and heterogeneous data sources with traditional surveys enhanced with rich socio-demographic information to enable insights into which opinions are expressed on the internet by specific segments of society. The goal of this research is to provide professionals in citizen- and consumer- centered domains with more concise near real-time intelligence on online opinions. To become effective, the methodologies presented in this paper must be integrated into a coherent decision support system.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf. Semi-supervised domain adaptation with non-parametric copulas. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin Ghahramani. Collaborative Gaussian processes for preference learning. In Advances in Neural Information Processing Systems 26, pages 2096-2104. Curran Associates, Inc., 2012.

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms.

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, and Pierre Dupont. Robust multi-class Gaussian process classification. In Advances in Neural Information Processing Systems 25, 2011.

Abstract: Multi-class Gaussian Processs Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully indentifying data instances which are difficult to classify correctly in practice.


Matthew Hoffman

Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In 17th International Conference on Artificial Intelligence and Statistics, pages 365-374, Reykjavik, Iceland, April 2014.

Abstract: We address the problem of finding the maximizer of a nonlinear function that can only be evaluated, subject to noise, at a finite number of query locations. Further, we will assume that there is a constraint on the total number of permitted function evaluations. We introduce a Bayesian approach for this problem and show that it empirically outperforms both the existing frequentist counterpart and other Bayesian optimization methods. The Bayesian approach places emphasis on detailed modelling, including the modelling of correlations among the arms. As a result, it can perform well in situations where the number of arms is much larger than the number of allowed function evaluation, whereas the frequentist counterpart is inapplicable. This feature enables us to develop and deploy practical applications, such as automatic machine learning toolboxes. The paper presents comprehensive comparisons of the proposed approach with many Bayesian and bandit optimization techniques, the first comparison of many of these methods in the literature.


Neil Houlsby

Neil Houlsby, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Cold-start active learning with robust ordinal matrix factorization. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We present a new matrix factorization model for rating data and a corresponding active learning strategy to address the cold-start problem. Cold-start is one of the most challenging tasks for recommender systems: what to recommend with new users or items for which one has little or no data. An approach is to use active learning to collect the most useful initial ratings. However, the performance of active learning depends strongly upon having accurate estimates of i) the uncertainty in model parameters and ii) the intrinsic noisiness of the data. To achieve these estimates we propose a heteroskedastic Bayesian model for ordinal matrix factorization. We also present a computationally efficient framework for Bayesian active learning with this type of complex probabilistic model. This algorithm successfully distinguishes between informative and noisy data points. Our model yields state-of-the-art predictive performance and, coupled with our active learning strategy, enables us to gain useful information in the cold-start setting from the very first active sample.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix factorization with non-random missing data. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We propose a probabilistic matrix factorization model for collaborative filtering that learns from data that is missing not at random (MNAR). Matrix factorization models exhibit state-of-the-art predictive performance in collaborative filtering. However, these models usually assume that the data is missing at random (MAR), and this is rarely the case. For example, the data is not MAR if users rate items they like more than ones they dislike. When the MAR assumption is incorrect, inferences are biased and predictive performance can suffer. Therefore, we model both the generative process for the data and the missing data mechanism. By learning these two models jointly we obtain improved performance over state-of-the-art methods when predicting the ratings and when modeling the data observation process. We present the first viable MF model for MNAR data. Our results are promising and we expect that further research on NMAR models will yield large gains in collaborative filtering.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Stochastic inference for scalable probabilistic modeling of binary matrices. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient because they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data used by our method. For this, we derive an algorithm that adjusts this hyper-parameter online.

Neil Houlsby and Massimiliano Ciaramita. A scalable Gibbs sampler for probabilistic entity linking. In 36th European Conference on Information Retrieval, pages 335-346. Springer, 2014.

Abstract: Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset.

Neil Houlsby and Guy Houlsby. Statistical fitting of undrained strength data. Geotechnique, 63(13):1253-1263, 2013, doi 10.1680/geot.13.P.007.

Abstract: We describe an approach, based on Bayesian statistical methods, that allows the fitting of a design profile to a set of measurements of undrained strengths. In particular we allow for the automatic determination of not only the positions of boundaries between geological units, but also the selection of the number of units to model the data in an appropriate way.

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán, Daniel M Wolpert, and Máté Lengyel. Cognitive tomography reveals complex task-independent mental representations. Current Biology, 23(21):2169-2175, 2013, doi 10.1016/j.cub.2013.09.012.

Abstract: Humans develop rich mental representations that guide their behavior in a variety of every-day tasks. However, it is unknown whether these representations, often formalized as priors in Bayesian inference, are specific for each task or subserve multiple tasks. Current approaches cannot distinguish between these two possibilities because they cannot extract comparable representations across different tasks. Here, we develop a novel method, termed cognitive tomography, that can extract complex, multi-dimensional priors across tasks. We apply this method to human judgments in two qualitatively different tasks, familiarity and odd-one-out, involving an ecologically relevant set of stimuli, human faces. We show that priors over faces are structurally complex and vary dramatically across subjects, but are invariant across the tasks within each subject. The priors we extract from each task allow us to predict with high precision the behavior of subjects for novel stimuli both in the same task as well as in the other task. Our results provide the first evidence for a single high-dimensional structured representation of a naturalistic stimulus set that guides behavior in multiple tasks. Moreover, the representations estimated by cognitive tomography can provide independent, behavior-based regressors for elucidating the neural correlates of complex naturalistic priors.

José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Stochastic inference for scalable probabilistic modeling of binary matrices. In NIPS Workshop on Randomized Methods for Machine Learning, 2013.

Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient since they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data and we propose an algorithm that adjusts this hyper-parameter in an online manner.

Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani. Active learning for interactive visualization. In 16th International Conference on Artificial Intelligence and Statistics, 2013.

Abstract: Many automatic visualization methods have been proposed. However, a visualization that is automatically generated might be different to how a user wants to arrange the objects in visualization space. By allowing users to re-locate objects in the embedding space of the visualization, they can adjust the visualization to their preference. We propose an active learning framework for interactive visualization which selects objects for the user to re-locate so that they can obtain their desired visualization by re-locating as few as possible. The framework is based on an information theoretic criterion, which favors objects that reduce the uncertainty of the visualization. We present a concrete application of the proposed framework to the Laplacian eigenmap visualization method. We demonstrate experimentally that the proposed framework yields the desired visualization with fewer user interactions than existing methods.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc Huszár, and Sergey Kulik. Experimental adaptive Bayesian tomography. Physical Review A, 87(6):062122, 2013.

Abstract: We report an experimental realization of an adaptive quantum state tomography protocol. Our method takes advantage of a Bayesian approach to statistical inference and is naturally tailored for adaptive strategies. For pure states we observe close to $N^-1$ scaling of infidelity with overall number of registered events, while best non-adaptive protocols allow for $N^-1/2$ scaling only. Experiments are performed for polarization qubits, but the approach is readily adapted to any dimension.

Ferenc Huszár and Neil Houlsby. Adaptive Bayesian quantum tomography. Physical Review A, 85(5):052120, 2012.

Abstract: In this paper we revisit the problem of optimal design of quantum tomographic experiments. In contrast to previous approaches where an optimal set of measurements is decided in advance of the experiment, we allow for measurements to be adaptively and efficiently re-optimised depending on data collected so far. We develop an adaptive statistical framework based on Bayesian inference and Shannon's information, and demonstrate a ten-fold reduction in the total number of measurements required as compared to non-adaptive methods, including mutually unbiased bases.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin Ghahramani. Collaborative Gaussian processes for preference learning. In Advances in Neural Information Processing Systems 26, pages 2096-2104. Curran Associates, Inc., 2012.

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv, abs/1112.5745, 2011.

Abstract: Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.


Ferenc Huszár

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán, Daniel M Wolpert, and Máté Lengyel. Cognitive tomography reveals complex task-independent mental representations. Current Biology, 23(21):2169-2175, 2013, doi 10.1016/j.cub.2013.09.012.

Abstract: Humans develop rich mental representations that guide their behavior in a variety of every-day tasks. However, it is unknown whether these representations, often formalized as priors in Bayesian inference, are specific for each task or subserve multiple tasks. Current approaches cannot distinguish between these two possibilities because they cannot extract comparable representations across different tasks. Here, we develop a novel method, termed cognitive tomography, that can extract complex, multi-dimensional priors across tasks. We apply this method to human judgments in two qualitatively different tasks, familiarity and odd-one-out, involving an ecologically relevant set of stimuli, human faces. We show that priors over faces are structurally complex and vary dramatically across subjects, but are invariant across the tasks within each subject. The priors we extract from each task allow us to predict with high precision the behavior of subjects for novel stimuli both in the same task as well as in the other task. Our results provide the first evidence for a single high-dimensional structured representation of a naturalistic stimulus set that guides behavior in multiple tasks. Moreover, the representations estimated by cognitive tomography can provide independent, behavior-based regressors for elucidating the neural correlates of complex naturalistic priors.

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc Huszár, and Sergey Kulik. Experimental adaptive Bayesian tomography. Physical Review A, 87(6):062122, 2013.

Abstract: We report an experimental realization of an adaptive quantum state tomography protocol. Our method takes advantage of a Bayesian approach to statistical inference and is naturally tailored for adaptive strategies. For pure states we observe close to $N^-1$ scaling of infidelity with overall number of registered events, while best non-adaptive protocols allow for $N^-1/2$ scaling only. Experiments are performed for polarization qubits, but the approach is readily adapted to any dimension.

Ferenc Huszár and David Duvenaud. Optimally-weighted herding is Bayesian quadrature. In 28th Conference on Uncertainty in Artificial Intelligence, pages 377-385, Catalina Island, California, July 2012.

Abstract: Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.

Ferenc Huszár and Neil Houlsby. Adaptive Bayesian quantum tomography. Physical Review A, 85(5):052120, 2012.

Abstract: In this paper we revisit the problem of optimal design of quantum tomographic experiments. In contrast to previous approaches where an optimal set of measurements is decided in advance of the experiment, we allow for measurements to be adaptively and efficiently re-optimised depending on data collected so far. We develop an adaptive statistical framework based on Bayesian inference and Shannon's information, and demonstrate a ten-fold reduction in the total number of measurements required as compared to non-adaptive methods, including mutually unbiased bases.

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, and Zoubin Ghahramani. Collaborative Gaussian processes for preference learning. In Advances in Neural Information Processing Systems 26, pages 2096-2104. Curran Associates, Inc., 2012.

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Geoff Gordon and David Dunson, editors, 14th International Conference on Artificial Intelligence and Statistics, volume 15, pages 416-424, Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

Abstract: We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task - and associated losses - for which the posterior could be used. We argue that this can be suboptimal and propose instead to loss-calibrate the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv, abs/1112.5745, 2011.

Abstract: Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.

Ferenc Huszár and Simon Lacoste-Julien. A kernel approach to tractable Bayesian nonparametrics. Technical report, University of Cambridge, 2011.

Abstract: Inference in popular nonparametric Bayesian models typically relies on sampling or other approximations. This paper presents a general methodology for constructing novel tractable nonparametric Bayesian methods by applying the kernel trick to inference in a parametric Bayesian model. For example, Gaussian process regression can be derived this way from Bayesian linear regression. Despite the success of the Gaussian process framework, the kernel trick is rarely explicitly considered in the Bayesian literature. In this paper, we aim to fill this gap and demonstrate the potential of applying the kernel trick to tractable Bayesian parametric models in a wider context than just regression. As an example, we present an intuitive Bayesian kernel machine for density estimation that is obtained by applying the kernel trick to a Gaussian generative model in feature space.

Comment: arXiv:1103.1761

Ferenc Huszár, Uta Noppeney, and Máté Lengyel. Mind reading by machine learning: A doubly Bayesian method for inferring mental representations. In S. Ohlsson and R. Catrambone, editors, The Proceedings of the 32nd Annual Meeting of the Cognitive Science Society, Austin, TX, USA, August 2010. The Cognitive Science Society.

Abstract: A central challenge in cognitive science is to measure and quantify the mental representations humans develop - in other words, to 'read' subject's minds. In order to eliminate potential biases in reporting mental contents due to verbal elaboration, subjects' responses in experiments are often limited to binary decisions or discrete choices that do not require conscious reflection upon their mental contents. However, it is unclear what such impoverished data can tell us about the potential richness and dynamics of subjects' mental representations. To address this problem, we used ideal observer models that formalise choice behaviour as (quasi-)Bayes-optimal, given subjects' representations in long-term memory, acquired through prior learning, and the stimuli currently available to them. Bayesian inversion of such ideal observer models allowed us to infer subjects' mental representation from their choice behaviour in a variety of psychophysical tasks. The inferred mental representations also allowed us to predict future choices of subjects with reasonable accuracy, even in tasks that were different from those in which the representations were estimated. These results demonstrate a significant potential in standard binary decision tasks to recover detailed information about subjects' mental representations

Comment: Supplementary material available here.


David Knowles

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. Beta diffusion trees and hierarchical feature allocations. Technical report, Dept. of Engineering, University of Cambridge, August 2014.

Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. Beta diffusion trees. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. A reversible infinite hmm using normalised random measures. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani. The supervised ibp: Neighbourhood preserving infinite latent feature models. In 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, USA, July 2013.

Abstract: We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. A nonparametric variable clustering model. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An infinite latent attribute model for network data. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a "flat" clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani. Gaussian process regression networks. Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering, University of Cambridge, Cambridge, UK, October 19 2011.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

Comment: arXiv:1110.4411

David A. Knowles and Zoubin Ghahramani. Nonparametric Bayesian sparse factor models with application to gene expression modelling.. Annals of Applied Statistics, 5(2B):1534-1552, 2011.

Abstract: A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model's utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani. Pitman-Yor diffusion trees. In 27th Conference on Uncertainty in Artificial Intelligence, 2011.

Abstract: We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

Comment: web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani. Message passing algorithms for the Dirichlet diffusion tree. In 28th International Conference on Machine Learning, 2011.

Abstract: We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

Comment: web site

David A. Knowles and Thomas P. Minka. Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems 25, 2011.

Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.

Comment: web site supplementary

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn. Inferring a measure of physiological age from multiple ageing related phenotypes. In NIPS Workshop: From Statistical Genetics to Predictive Models in Personalized Medicine, 2011.

Abstract: What is ageing? One definition is simultaneous degradation of multiple organ systems. Can an individual be said to be "old" or "young" for their (chronological) age in a scientifically meaningful way? We investigate these questions using ageing related phenotypes measured on the 12,000 female twins in the Twins UK study. We propose a simple linear model of ageing, which allows a latent adjustment to be made to an individual's chronological age to give her "physiological age", shared across the observed phenotypes. We note problems with the analysis resulting from the linearity assumption and show how to alleviate these issues using a non-linear extension. We find more gene expression probes are significantly associated with our measurement of physiological age than to chronological age.

Comment: web site

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider, Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin Goddard, Pietro Lio, Martin Bennett, and Roger Foo. Distinct epigenomic features in human cardiomyopathy. Circulation, American Heart Association, 2011.

Abstract: Background. The epigenome refers to marks on the genome including DNA methylation and histone modifications that regulate the expression of underlying genes. A consistent profile of gene expression changes in end- stage cardiomyopathy led us to hypothesise that distinct global patterns of the epigenome may also exist. Methods and Results. We constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36 tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human hearts. 506Mb of sequence per library was generated by high-throughput sequencing, covering 24 million out of the 28 million CG di-nucleotides in the human genome. DNA methylation was significantly different in promoter CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions of the genome. Moreover DNA methylation differences were present in promoters of upregulated genes but not down-regulated genes. The profile of H3K36me3-enrichment itself was also significantly different in protein-coding regions of the genome. Conclusions. Distinct epigenomic patterns exist in important DNA elements of the human cardiac genome in end-stage cardiomyopathy. If epigenomic patterns track with disease progression, assays for the epigenome may be more useful than quantification of mRNA for assessing prognosis in heart failure. These results open up an important new horizon of research and further studies will be needed to determine how epigenomics contribute to altered gene expression in cardiomyopathy.

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, and Denis Burdakov. Dichotomous cellular properties of mouse orexin/hypocretin neurons. The Journal of Physiology, 2011.

Abstract: Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as critical regulators of sleep-wake cycles, reward-seeking, and body energy balance. However, at the level of cellular and network properties, it remains unclear whether hcrt/orx neurons are one homogenous population, or whether there are several distinct types of hcrt/orx cells. Here, we collated diverse structural and functional information about individual hcrt/orx neurons in mouse brain slices, by combining patch-clamp analysis of spike firing, membrane currents, and synaptic inputs with confocal imaging of cell shape and subsequent 3-dimensional Sholl analysis of dendritic architecture. Statistical cluster analysis of intrinsic firing properties revealed that hcrt/orx neurons fall into two distinct types. These two cell types also differ in the complexity of their dendritic arbour, the strength of AMPA and GABAA receptor-mediated synaptic drive that they receive, and the density of low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results provide quantitative evidence that, at the cellular level, the mouse hcrt/orx system is composed of two classes of neurons with different firing properties, morphologies, and synaptic input organization.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, , and Tim D. Spector. No correlation between childhood maltreatment and telomere length.. Biological Psychiatry, 68(6):21-22, 2010.

Abstract: Telomeres are lengths of repetitive DNA that cap the ends of chromosomes. They protect the ends of the chromosome and shorten with each cell division. Short leukocyte telomere length has been related to a number of age-related diseases. In addition, shorter telomere length has been associated with environmental factors such as smoking and lack of exercise. In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a report suggesting a link between maltreatment in childhood and telomere shortening in 31 subjects. Individuals who had suffered maltreatment had telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had not been abused.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn. Modeling skin and ageing phenotypes using latent variable models in infer.net. In NIPS Workshop: Predictive Models in Personalized Medicine Workshop, 2010.

Abstract: We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study. We address various data modeling problems include high missingness, heterogeneous data, and repeat observations. We compare the proposed models in terms of their performance at predicting disease labels and symptoms from available explanatory variables, concluding that factor analysis type models have the strongest statistical performance in this setting. We show that such models can be combined with regression components for improved interpretability.

Comment: web site

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani. Large scale non-parametric inference: Data parallelisation in the Indian buffet process. In Advances in Neural Information Processing Systems 23, pages 1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

David A. Knowles and Susan Holmes. Statistical tools for ultra-deep pyrosequencing of fast evolving viruses. In NIPS Workshop: Computational Biology, 2009.

Abstract: We aim to detect minor variant Hepatitis B viruses (HBV) in 38 pyrosequencing samples from infected individuals. Errors involved in the amplification and ultra deep pyrosequencing (UDPS) of these samples are characterised using HBV plasmid controls. Homopolymeric regions and quality scores are found to be significant covariates in determining insertion and deletion (indel) error rates, but not mismatch rates which depend on the nucleotide transition matrix. This knowledge is used to derive two methods for classifying genuine mutations: a hypothesis testing framework and a mixture model. Using an approximate "ground truth" from a limiting dilution Sanger sequencing run, these methods are shown to outperform the naive percentage threshold approach. The possibility of early stage PCR errors becoming significant is investigated by simulation, which underlines the importance of the initial copy number.

Comment: web site

David Knowles and Zoubin Ghahramani. Infinite sparse factor analysis and infinite independent components analysis. In 7th International Conference on Independent Component Analysis and Signal Separation, pages 381-388, London, UK, September 2007. Springer, doi 10.1007/978-3-540-74494-8_48.

Abstract: A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.


Simon Lacoste-Julien

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In KDD, pages 572-580. ACM, 2013.

Abstract: The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Geoff Gordon and David Dunson, editors, 14th International Conference on Artificial Intelligence and Statistics, volume 15, pages 416-424, Fort Lauderdale, FL, USA, April 2011. Journal of Machine Learning Research.

Abstract: We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task - and associated losses - for which the posterior could be used. We argue that this can be suboptimal and propose instead to loss-calibrate the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.

Ferenc Huszár and Simon Lacoste-Julien. A kernel approach to tractable Bayesian nonparametrics. Technical report, University of Cambridge, 2011.

Abstract: Inference in popular nonparametric Bayesian models typically relies on sampling or other approximations. This paper presents a general methodology for constructing novel tractable nonparametric Bayesian methods by applying the kernel trick to inference in a parametric Bayesian model. For example, Gaussian process regression can be derived this way from Bayesian linear regression. Despite the success of the Gaussian process framework, the kernel trick is rarely explicitly considered in the Bayesian literature. In this paper, we aim to fill this gap and demonstrate the potential of applying the kernel trick to tractable Bayesian parametric models in a wider context than just regression. As an example, we present an intuitive Bayesian kernel machine for density estimation that is obtained by applying the kernel trick to a Gaussian generative model in feature space.

Comment: arXiv:1103.1761


James Lloyd

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), July 2014.

Abstract: This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.

José Miguel Hernández-Lobato, James Robert Lloyds, and Daniel Hernández-Lobato. Gaussian process conditional copulas with applications to financial time series. In Advances in Neural Information Processing Systems 27, pages 1-9, Lake Tahoe, California, USA, December 2013.

Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other time-varying copula methods.

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.

James Robert Lloyd. Gefcom2012 hierarchical load forecasting: Gradient boosting machines and gaussian processes. International Journal of Forecasting, 2013.

Abstract: This report discusses methods for forecasting hourly loads of a US utility as part of the load forecasting track of the Global Energy Forecasting Competition 2012 hosted on Kaggle. The methods described (gradient boosting machines and Gaussian processes) are generic machine learning / regression algorithms and few domain specific adjustments were made. Despite this, the algorithms were able to produce highly competitive predictions and hopefully they can inspire more refined techniques to compete with state-of-the-art load forecasting methodologies.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.


David Lopez-Paz

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. In ICML, volume 29 of JMLR Proceedings. JMLR.org, 2014.

Abstract: Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.

David Lopez-Paz, Philipp Hennig, and Bernhard Scholköpf. The randomized dependence coefficient. In Advances in Neural Information Processing Systems 27, pages 1-9, Lake Tahoe, California, USA, December 2013.

Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of non-linear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper.

David Lopez-Paz, José Miguel Hernández-Lobato, and Zoubin Ghahramani. Gaussian process vine copulas for multivariate dependence. In 30th International Conference on Machine Learning, Atlanta, Georgia, USA, June 2013.

Abstract: Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

David Lopez-Paz, José Miguel Hernández-Lobato, and Bernhard Scholköpf. Semi-supervised domain adaptation with non-parametric copulas. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques.


Andrew McHutchon

Andrew McHutchon and Carl Edward Rasmussen. Gaussian process training with input noise. In Advances in Neural Information Processing Systems 24, Granada, Spain, December 2012.

Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise hyperparameters are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood, and allow estimation of the noise levels on each input dimension. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.


Shakir Mohamed

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani. Bayesian and L1 approaches for sparse unsupervised learning. In 29th International Conference on Machine Learning, 2012.

Abstract: The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of "L1", and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods out-perform L1 minimisation, even on a com- putational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

Shakir Mohamed. Generalised Bayesian matrix factorisation models. PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK, 2011.

Abstract: Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others.
This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models. The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression.
Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation.

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani. Large scale non-parametric inference: Data parallelisation in the Indian buffet process. In Advances in Neural Information Processing Systems 23, pages 1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

Shakir Mohamed, Katherine A. Heller, and Zoubin Ghahramani. Bayesian exponential family PCA. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1089-1096, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.

Comment: spotlight.

Mikkel N. Schmidt and Shakir Mohamed. Probabilistic non-negative tensor factorization using Markov chain Monte Carlo. In European Signal Processing Conference (EUSIPCO), pages 1918-1922, Glasgow, Scotland, August 2009.

Abstract: We present a probabilistic model for learning non-negative tensor factorizations (NTF), in which the tensor factors are latent variables associated with each data dimension. The non-negativity constraint for the latent factors is handled by choosing priors with support on the non-negative numbers. Two Bayesian inference procedures based on Markov chain Monte Carlo sampling are described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We evaluate the model on two food science data sets, and show that the probabilistic NTF model leads to better predictions and avoids overfitting compared to existing NTF approaches.

Comment: Rated by reviewers amongst the top 5% of the presented papers.


Peter Orbanz

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.

Peter Orbanz. Projective limit random probabilities on Polish spaces. Electron. J. Stat., 5:1354-1373, 2011.

Abstract: A pivotal problem in Bayesian nonparametrics is the construction of prior distributions on the space M(V) of probability measures on a given domain V. In principle, such distributions on the infinite-dimensional space M(V) can be constructed from their finite-dimensional marginals—the most prominent example being the construction of the Dirichlet process from finite-dimensional Dirichlet distributions. This approach is both intuitive and applicable to the construction of arbitrary distributions on M(V), but also hamstrung by a number of technical difficulties. We show how these difficulties can be resolved if the domain V is a Polish topological space, and give a representation theorem directly applicable to the construction of any probability distribution on M(V) whose first moment measure is well-defined. The proof draws on a projective limit theorem of Bochner, and on properties of set functions on Polish spaces to establish countable additivity of the resulting random probabilities.

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani. Dependent Indian buffet processes. In 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W & CP, pages 924-931, Chia Laguna, Sardinia, Italy, May 2010.

Abstract: Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.

Peter Orbanz and Yee-Whye Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. Springer, 2010.

Peter Orbanz. Construction of nonparametric Bayesian models from parametric Bayes equations. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1392-1400. The MIT Press, 2009.

Abstract: We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data.

Comment: Supplements (proofs) and techreport version


Pedro Ortega

Daniel A. Braun, Pedro A. Ortega, Evangelos Theodorou, and Stefan Schaal. Path integral control and bounded rationality. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011.

Abstract: Path integral methods have recently been shown to be applicable to a very general class of optimal control problems. Here we examine the path integral formalism from a decision-theoretic point of view, since an optimal controller can always be regarded as an instance of a perfectly rational decision-maker that chooses its actions so as to maximize its expected utility. The problem with perfect rationality is, however, that finding optimal actions is often very difficult due to prohibitive computational resource costs that are not taken into account. In contrast, a bounded rational decision-maker has only limited resources and therefore needs to strike some compromise between the desired utility and the required resource costs. In particular, we suggest an information-theoretic measure of resource costs that can be derived axiomatically. As a consequence we obtain a variational principle for choice probabilities that trades off maximizing a given utility criterion and avoiding resource costs that arise due to deviating from initially given default choice probabilities. The resulting bounded rational policies are in general probabilistic. We show that the solutions found by the path integral formalism are such bounded rational policies. Furthermore, we show that the same formalism generalizes to discrete control problems, leading to linearly solvable bounded rational control policies in the case of Markov systems. Importantly, Bellman's optimality principle is not presupposed by this variational principle, but it can be derived as a limit case. This suggests that the information theoretic formalization of bounded rationality might serve as a general principle in control design that unifies a number of recently reported approximate optimal control methods both in the continuous and discrete domain.

Daniel A. Braun, Pedro A. Ortega, and Daniel M. Wolpert. Motor coordination: When two have to act as one. Special issue of Experimental Brain Research on Joint Action, 2011.

Abstract: Trying to pass someone walking toward you in a narrow corridor is a familiar example of a two-person motor game that requires coordination. In this study, we investigate coordination in sensorimotor tasks that correspond to classic coordination games with multiple Nash equilibria, such as "choosing sides", "stag hunt", "chicken", and "battle of sexes". In these tasks, subjects made reaching movements reflecting their continuously evolving "decisions" while they received a continuous payoff in the form of a resistive force counteracting their movements. Successful coordination required two subjects to "choose" the same Nash equilibrium in this force-payoff landscape within a single reach. We found that on the majority of trials coordination was achieved. Compared to the proportion of trials in which miscoordination occurred, successful coordination was characterized by several distinct features: an increased mutual information between the players' movement endpoints, an increased joint entropy during the movements, and by differences in the timing of the players' responses. Moreover, we found that the probability of successful coordination depends on the players' initial distance from the Nash equilibria. Our results suggest that two-person coordination arises naturally in motor interactions and is facilitated by favorable initial positions, stereotypical motor pattern, and differences in response times.

Pedro A. Ortega and Daniel A. Braun. Information, utility and bounded rationality. In The fourth conference on artificial general intelligence, volume 6830 of Lecture Notes on Artificial Intelligence, pages 269-274. Springer-Verlag, 2011.

Abstract: Perfectly rational decision-makers maximize expected utility, but crucially ignore the resource costs incurred when determining optimal actions. Here we employ an axiomatic framework for bounded rational decision-making based on a thermodynamic interpretation of resource costs as information costs. This leads to a variational free utility principle akin to thermodynamical free energy that trades off utility and information costs. We show that bounded optimal control solutions can be derived from this variational principle, which leads in general to stochastic policies. Furthermore, we show that risk-sensitive and robust (minimax) control schemes fall out naturally from this framework if the environment is considered as a bounded rational and perfectly rational opponent, respectively. When resource costs are ignored, the maximum expected utility principle is recovered.

Pedro A. Ortega, Daniel A. Braun, and Simon Godsill. Reinforcement learning and the Bayesian control rule. In The fourth conference on artificial general intelligence, volume 6830 of Lecture Notes on Artificial Intelligence, pages 281-285. Springer-Verlag, 2011.

Abstract: We present an actor-critic scheme for reinforcement learning in complex domains. The main contribution is to show that planning and I/O dynamics can be separated such that an intractable planning problem reduces to a simple multi-armed bandit problem, where each lever stands for a potentially arbitrarily complex policy. Furthermore, we use the Bayesian control rule to construct an adaptive bandit player that is universal with respect to a given class of optimal bandit players, thus indirectly constructing an adaptive agent that is universal with respect to a given class of policies.

Pedro A. Ortega. A Unified Framework for Resource-Bounded Agents Interacting with an Unknown Environment. PhD thesis, Department of Engineering, University of Cambridge, 2011.

Abstract: The aim of this thesis is to present a mathematical framework for conceptualizing and constructing adaptive autonomous systems under resource constraints. The first part of this thesis contains a concise presentation of the foundations of classical agency: namely the formalizations of decision making and learning. Decision making includes: (a) subjective expected utility (SEU) theory, the framework of decision making under uncertainty; (b) the maximum SEU principle to choose the optimal solution; and (c) its application to the design of autonomous systems, culminating in the Bellman optimality equations. Learning includes: (a) Bayesian probability theory, the theory for reasoning under uncertainty that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian probability theory to the design of optimal adaptive agents. Then, two major problems of the maximum SEU principle are highlighted: (a) the prohibitive computational costs and (b) the need for the causal precedence of the choice of the policy. The second part of this thesis tackles the two aforementioned problems. First, an information-theoretic notion of resources in autonomous systems is established. Second, a framework for resource-bounded agency is introduced. This includes: (a) a maximum bounded SEU principle that is derived from a set of axioms of utility; (b) an axiomatic model of probabilistic causality, which is applied for the formalization of autonomous systems having uncertainty over their policy and environment; and (c) the Bayesian control rule, which is derived from the maximum bounded SEU principle and the model of causality, implementing a stochastic adaptive control law that deals with the case where autonomous agents are uncertain about their policy and environment.

Daniel A. Braun and Pedro A. Ortega. A minimum relative entropy principle for adaptive control in linear quadratic regulators. In Proceedings of the 7th international conference on informatics in control, automation and robotics, page (in press), 2010.

Abstract: The design of optimal adaptive controllers is usually based on heuristics, because solving Bellman's equations over information states is notoriously intractable. Approximate adaptive controllers often rely on the principle of certainty-equivalence where the control process deals with parameter point estimates as if they represented ``true'' parameter values. Here we present a stochastic control rule instead where controls are sampled from a posterior distribution over a set of probabilistic input-output models and the true model is identified by Bayesian inference. This allows reformulating the adaptive control problem as an inference and sampling problem derived from a minimum relative entropy principle. Importantly, inference and action sampling both work forward in time and hence such a Bayesian adaptive controller is applicable on-line. We demonstrate the improved performance that can be achieved by such an approach for linear quadratic regulator examples.

Pedro A. Ortega and Daniel A. Braun. An axiomatic formalization of bounded rationality based on a utility-information equivalence. Technical report, Dept. of Engineering, University of Cambridge, 2010.

Abstract: Classic decision-theory is based on the maximum expected utility (MEU) principle, but crucially ignores the resource costs incurred when determining optimal decisions. Here we propose an axiomatic framework for bounded decision-making that considers resource costs. Agents are formalized as probability measures over input-output streams. We postulate that any such probability measure can be assigned a corresponding conjugate utility function based on three axioms: utilities should be real-valued, additive and monotonic mappings of probabilities. We show that these axioms enforce a unique conversion law between utility and probability (and thereby, information). Moreover, we show that this relation can be characterized as a variational principle: given a utility function, its conjugate probability measure maximizes a free utility functional. Transformations of probability measures can then be formalized as a change in free utility due to the addition of new constraints expressed by a target utility function. Accordingly, one obtains a criterion to choose a probability measure that trades off the maximization of a target utility function and the cost of the deviation from a reference distribution. We show that optimal control, adaptive estimation and adaptive control problems can be solved this way in a resource-efficient way. When resource costs are ignored, the MEU principle is recovered. Our formalization might thus provide a principled approach to bounded rationality that establishes a close link to information theory.

Pedro A. Ortega and Daniel A. Braun. A Bayesian rule for adaptive control based on causal interventions. In The third conference on artificial general intelligence, pages 115-120, Paris, 2010. Atlantis Press.

Abstract: Explaining adaptive behavior is a central problem in artificial intelligence research. Here we formalize adaptive agents as mixture distributions over sequences of inputs and outputs (I/O). Each distribution of the mixture constitutes a "possible world", but the agent does not know which of the possible worlds it is actually facing. The problem is to adapt the I/O stream in a way that is compatible with the true world. A natural measure of adaptation can be obtained by the Kullback Leibler (KL) divergence between the I/O distribution of the true world and the I/O distribution expected by the agent that is uncertain about possible worlds. In the case of pure input streams, the Bayesian mixture provides a well-known solution for this problem. We show, however, that in the case of I/O streams this solution breaks down, because outputs are issued by the agent itself and require a different probabilistic syntax as provided by intervention calculus. Based on this calculus, we obtain a Bayesian control rule that allows modeling adaptive behavior with mixture distributions over I/O streams. This rule might allow for a novel approach to adaptive control based on a minimum KL-principle.

Pedro A. Ortega and Daniel A. Braun. A conversion between utility and information. In The third conference on artificial general intelligence, pages 115-120, Paris, 2010. Atlantis Press.

Abstract: Rewards typically express desirabilities or preferences over a set of alternatives. Here we propose that rewards can be defined for any probability distribution based on three desiderata, namely that rewards should be real- valued, additive and order-preserving, where the later implies that more probable events should also be more desirable. Our main result states that rewards are then uniquely determined by the negative information content. To analyze stochastic processes, we define the utility of a realization as its reward rate. Under this interpretation, we show that the expected utility of a stochastic process is its negative entropy rate. Furthermore, we apply our results to analyze agent-environment interactions. We show that the expected utility that will actually be achieved by the agent is given by the negative cross-entropy from the input-output (I/O) distribution of the coupled interaction system and the agent's I/O distribution. Thus, our results allow for an information-theoretic interpretation of the notion of utility and the characterization of agent-environment interactions in terms of entropy dynamics.

Pedro A. Ortega and Daniel A. Braun. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research, 38:475-511, 2010, doi 10.1613/jair.3062.

Abstract: This paper proposes a method to construct an adaptive agent that is univemacmacrsal with respect to a given class of experts, where each expert is designed specifically for a particular environment. This adaptive control problem is formalized as the problem of minimizing the relative entropy of the adaptive agent from the expert that is most suitable for the unknown environment. If the agent is a passive observer, then the optimal solution is the well-known Bayesian predictor. However, if the agent is active, then its past actions need to be treated as causal interventions on the I/O stream rather than normal probability conditions. Here it is shown that the solution to this new variational problem is given by a stochastic controller called the Bayesian control rule, which implements adaptive behavior as a mixture of experts. Furthermore, it is shown that under mild assumptions, the Bayesian control rule converges to the control law of the most suitable expert.

Daniel A. Braun, Pedro A. Ortega, and Daniel M. Wolpert. Nash equilibria in multi-agent motor interactions. PLoS Computational Biology, 5(8), 2009.

Abstract: Social interactions in classic cognitive games like the ultimatum game or the prisoner's dilemma typically lead to Nash equilibria when multiple competitive decision makers with perfect knowledge select optimal strategies. However, in evolutionary game theory it has been shown that Nash equilibria can also arise as attractors in dynamical systems that can describe, for example, the population dynamics of microorganisms. Similar to such evolutionary dynamics, we find that Nash equilibria arise naturally in motor interactions in which players vie for control and try to minimize effort. When confronted with sensorimotor interaction tasks that correspond to the classical prisoner's dilemma and the rope-pulling game, two-player motor interactions led predominantly to Nash solutions. In contrast, when a single player took both roles, playing the sensorimotor game bimanually, cooperative solutions were found. Our methodology opens up a new avenue for the study of human motor interactions within a game theoretic framework, suggesting that the coupling of motor systems can lead to game theoretic solutions.


Konstantina Palla

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. A reversible infinite hmm using normalised random measures. In 31st International Conference on Machine Learning, Beijing, China, June 2014.

Abstract: We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In KDD, pages 572-580. ACM, 2013.

Abstract: The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. A nonparametric variable clustering model. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems.

Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An infinite latent attribute model for network data. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a "flat" clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.


Novi Quadrianto

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, and Zoubin Ghahramani. Scalable Gaussian Process structured prediction for grid factor graph applications. In 31st International Conference on Machine Learning, 2014.

Abstract: Structured prediction is an important and well studied problem with many applications across machine learning. GPstruct is a recently proposed structured prediction model that offers appealing properties such as being kernelised, non-parametric, and supporting Bayesian inference (Bratières et al. 2013). The model places a Gaussian process prior over energy functions which describe relationships between input variables and structured output variables. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners (predictors) trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes.

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, and Zoubin Ghahramani. The supervised ibp: Neighbourhood preserving infinite latent feature models. In 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, USA, July 2013.

Abstract: We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.

Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani. Bayesian structured prediction using Gaussian processes. arXiv, abs/1307.3846, 2013.

Abstract: We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M3N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct.

Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert. Learning to rank using privileged information. In International Conference on Computer Vision, 2013.

Abstract: Many computer vision problems have an asymmetric distribution of information between training and test time. In this work, we study the case where we are given additional information about the training data, which however will not be available at test time. This situation is called learning using privileged information (LUPI). We introduce two maximum-margin techniques that are able to make use of this additional source of information, and we show that the framework is applicable to several scenarios that have been studied in computer vision before. Experiments with attributes, bounding boxes, image tags and rationales as additional information in object classification show promising results.

Novi Quadrianto, Chao Chen, and Christoph Lampert. The most persistent soft-clique in a set of sampled graphs. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1) is almost fully or at least densely connected, 2) occurs in all or almost all graph instances, and 3) has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a) a max-min two person game optimization problem, or b) a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations.

Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert. Augmented attributes representations. In 12th European Conference on Computer Vision, pages 242-255, 2012.

Abstract: We propose a new learning method to infer a mid-level feature representation that combines the advantage of semantic attribute representations with the higher expressive power of non-semantic features. The idea lies in augmenting an existing attribute-based representation with additional dimensions for which an autoencoder model is coupled with a large-margin principle. This construction allows a smooth transition between the zero-shot regime with no training example, the unsupervised regime with training examples but without class labels, and the supervised regime with training examples and with class labels. The resulting optimization problem can be solved efficiently, because several of the necessity steps have closed-form solutions. Through extensive experiments we show that the augmented representation achieves better results in terms of object categorization accuracy than the semantic representation alone.

Tatiana Tommasi, Novi Quadrianto, Barbara Caputo, and Christoph Lampert. Beyond dataset bias: Multi-task unaligned shared knowledge transfer. In 11th Asian Conference on Computer Vision, 2012.

Abstract: Many visual datasets are traditionally used to analyze the performance of different learning techniques. The evaluation is usually done within each dataset, therefore it is questionable if such results are a reliable indicator of true generalization ability. We propose here an algorithm to exploit the existing data resources when learning on a new multiclass problem. Our main idea is to identify an image representation that decomposes orthogonally into two subspaces: a part specific to each dataset, and a part generic to, and therefore shared between, all the considered source sets. This allows us to use the generic representation as un-biased reference knowledge for a novel classification task. By casting the method in the multi-view setting, we also make it possible to use different features for different databases. We call the algorithm MUST, Multitask Unaligned Shared knowledge Transfer. Through extensive experiments on five public datasets, we show that MUST consistently improves the cross-datasets generalization performance.


Carl Edward Rasmussen

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward Rasmussen. Identification of Gaussian process state-space models with particle stochastic approximation EM. In Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC), 2014.

Abstract: Gaussian process state-space models (GP-SSMs) are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional (hyper-)parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification.

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, doi 10.1109/TPAMI.2013.218.

Abstract: Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.

Comment: accepted

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Edward Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013.

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.

Roger Frigola and Carl Edward Rasmussen. Integrated pre-processing for Bayesian nonlinear system identification with Gaussian processes. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, 2013.

Abstract: We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.

Andrew McHutchon and Carl Edward Rasmussen. Gaussian process training with input noise. In Advances in Neural Information Processing Systems 24, Granada, Spain, December 2012.

Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise hyperparameters are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood, and allow estimation of the noise levels on each input dimension. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen, Stephen J. Roberts, and Zoubin Ghahramani. Active learning of model evidence using Bayesian quadrature. In Advances in Neural Information Processing Systems 25, pages 46-54, Lake Tahoe, California, USA, December 2012.

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy.

John P. Cunningham, Zoubin Ghahramani, and Carl E. Rasmussen. Gaussian Processes for time-marked time-series data. In 15th International Conference on Artificial Intelligence and Statistics, 2012.

Abstract: In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data.

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and Carl Edward Rasmussen. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865-1871, 2012.

Abstract: We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by nonparametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of "system identification" is more robust than finding point estimates of a parametric function representation. Our principled filtering/smoothing approach for GP dynamic systems is based on analytic moment matching in the context of the forward-backward algorithm. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski. Modelling and control of nonlinear systems using Gaussian processes with partial model information. In 51st IEEE Conference on Decision and Control, 2012.

Abstract: Gaussian processes are gaining increasing popularity among the control community, in particular for the modelling of discrete time state space systems. However, it has not been clear how to incorporate model information, in the form of known state relationships, when using a Gaussian process as a predictive model. An obvious example of known prior information is position and velocity related states. Incorporation of such information would be beneficial both computationally and for faster dynamics learning. This paper introduces a method of achieving this, yielding faster dynamics learning and a reduction in computational effort from O(Dn2) to O((D-F)n2) in the prediction stage for a system with D states, F known state relationships and n observations. The effectiveness of the method is demonstrated through its inclusion in the PILCO learning algorithm with application to the swing-up and balance of a torque-limited pendulum and the balancing of a robotic unicycle in simulation.

Ryan D. Turner and Carl Edward Rasmussen. Model based learning of sigma points in unscented Kalman filtering. Neurocomputing, 80:47-53, 2012, doi 10.1016/j.neucom.2011.07.029.

Abstract: The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

David Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen. Additive Gaussian processes. In Advances in Neural Information Processing Systems 24, pages 226-234, Granada, Spain, December 2011.

Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a low-cost manipulator using data-efficient reinforcement learning. In 9th International Conference on Robotics: Science & Systems, Los Angeles, CA, USA, June 2011.

Abstract: Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful of trials - from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the work space of a Kinect-style depth camera, we use a model-based reinforcement learning technique. Our learning method is data efficient, reduces model bias, and deals with several noise sources in a principled way during long-term planning. We present a way of incorporating state-space constraints into the learning process and analyze the learning gain by exploiting the sequential structure of the stacking task.

Comment: project site

Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In 28th International Conference on Machine Learning, 2011.

Abstract: In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.

Comment: web site

Joseph Hall, Carl Edward Rasmussen, and Jan Maciejowski. Reinforcement learning with reference tracking control in continuous state spaces. In Proceedings of 50th IEEE Conference on Decision and Control and European Control Conference, 2011.

Abstract: The contribution described in this paper is an algorithm for learning nonlinear, reference tracking, control policies given no prior knowledge of the dynamical system and limited interaction with the system through the learning process. Concepts from the field of reinforcement learning, Bayesian statistics and classical control have been brought together in the formulation of this algorithm which can be viewed as a form indirect self tuning regulator. On the task of reference tracking using the inverted pendulum it was shown to yield generally improved performance on the best controller derived from the standard linear quadratic method using only 30 s of total interaction with the system. Finally, the algorithm was shown to work on the double pendulum proving its ability to solve nontrivial control tasks.

Carl Edward Rasmussen and Hannes Nickisch. Gaussian Processes for Machine Learning (GPML) Toolbox. Journal of Machine Learning Research, 11:3011-3015, December 2010.

Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace's method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.

Comment: Toolbox avaiable from here. Implements algorithms from Rasmussen and Williams, 2006.

Hannes Nickisch and Carl Edward Rasmussen. Gaussian mixture modeling with Gaussian process latent variable models. In Proceedings of the 32nd DAGM Symposium on Pattern Recognition, Lecture Notes in Computer Science (LNCS), Darmstadt, Germany, September 2010. Springer, doi 10.1007/978-3-642-15986-2_28.

Abstract: Density modeling is notoriously difficult for high dimensional data. One approach to the problem is to search for a lower dimensional manifold which captures the main characteristics of the data. Recently, the Gaussian Process Latent Variable Model (GPLVM) has successfully been used to find low dimensional manifolds in a variety of complex data. The GPLVM consists of a set of points in a low dimensional latent space, and a stochastic map to the observed space. We show how it can be interpreted as a density model in the observed space. However, the GPLVM is not trained as a density model and therefore yields bad density estimates. We propose a new training strategy and obtain improved generalisation performance and better density estimates in comparative evaluations on several benchmark data sets.

Ryan Turner and Carl Edward Rasmussen. Model based learning of sigma points in unscented Kalman filtering. In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors, Machine Learning for Signal Processing (MLSP 2010), pages 178-183, Kittilä, Finland, August 2010.

Abstract: The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for a step known as sigma point placement, causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

Dilan Görür and Carl Edward Rasmussen. Dirichlet process Gaussian mixture models: Choice of the base distribution. Journal of Computer Science and Technology, 25(4):615-625, July 2010, doi 10.1007/s11390-010-9355-8.

Abstract: In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of finding the "correct" number of mixture components by assuming infinitely many components. In this paper Dirichlet process mixture (DPM) models are cast as infinite mixture models and inference using Markov chain Monte Carlo is described. The specification of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM). We compare computational efficiency and modeling performance of DPGMM defined using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational effort.

Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11:1865-1881, June 2010.

Abstract: We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen. Gaussian process change point models. In 27th International Conference on Machine Learning, pages 927-934, Haifa, Israel, June 2010.

Abstract: We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points. The model can be used to locate change points in an online manner; and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected. We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages. We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.

Comment: poster, slides.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen. State-space inference and learning with Gaussian processes. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W & CP, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15 2010. Journal of Machine Learning Research.

Abstract: State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model.

Comment: poster.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen. System identification in Gaussian process dynamical systems. In Dilan Görür, editor, NIPS Workshop on Nonparametric Bayes, Whistler, BC, Canada, December 2009.

Comment: poster.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen. Adaptive sequential Bayesian change point detection. In Zaïd Harchaoui, editor, NIPS Workshop on Temporal Segmentation, Whistler, BC, Canada, December 2009.

Abstract: Real-world time series are often nonstationary with respect to the parameters of some underlying prediction model (UPM). Furthermore, it is often desirable to adapt the UPM to incoming regime changes as soon as possible, necessitating sequential inference about change point locations. A Bayesian algorithm for online change point detection (BOCPD) has been introduced recently by Adams and MacKay (2007). In this algorithm, uncertainty about the last change point location is updated sequentially, and is integrated out to make online predictions robust to parameter changes. BOCPD requires a set of fixed hyper-parameters which allow the user to fully specify the hazard function for change points and the prior distribution over the parameters of the UPM. In practice, finding the "right" hyper-parameters can be quite difficult. We therefore extend BOCPD by introducing hyper-parameter learning, without sacrificing the online nature of the algorithm. Hyper-parameter learning is performed by optimizing the marginal likelihood of the BOCPD model, a closed-form quantity which can be computed sequentially. We illustrate performance on three real-world datasets.

Comment: poster, slides.

Marc Peter Deisenroth and Carl Edward Rasmussen. Efficient reinforcement learning for motor control. In 10th International PhD Workshop on Systems and Control, Hluboká nad Vltavou, Czech Republic, September 2009.

Abstract: Artificial learners often require many more trials than humans or animals when learning motor control tasks in the absence of expert knowledge. We implement two key ingredients of biological learning systems, generalization and incorporation of uncertainty into the decision-making process, to speed up artificial learning. We present a coherent and fully Bayesian framework that allows for efficient artificial learning in the absence of expert knowledge. The success of our learning framework is demonstrated on challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth and Carl Edward Rasmussen. Bayesian inference for efficient learning in control. In Multidisciplinary Symposium on Reinforcement Learning, Montréal, QC, Canada, June 2009.

Abstract: In contrast to humans or animals, artificial learners often require more trials when learning motor control tasks solely based on experience. Efficient autonomous learners will reduce the amount of engineering required to solve control problems. By using probabilistic forward models, we can employ two key ingredients of biological learning systems to speed up artificial learning. We present a consistent and coherent Bayesian framework that allows for efficient autonomous experience-based learning. We demonstrate the success of our learning algorithm by applying it to challenging nonlinear control problems in simulation and in hardware.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters. Gaussian process dynamic programming. Neurocomputing, 72(7-9):1508-1524, March 2009, doi 10.1016/j.neucom.2008.12.019.

Abstract: Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm. We consider both a classic optimal control problem, where problem-specific prior knowledge is available, and a classic RL problem, where only very general priors can be used. For the classic optimal control problem, GPDP models the unknown value functions with Gaussian processes and generalizes dynamic programming to continuous-valued states and actions. For the RL problem, GPDP starts from a given initial state and explores the state space using Bayesian active learning. To design a fast learner, available data have to be used efficiently. Hence, we propose to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly. In both cases, we successfully apply the resulting continuous-valued controllers to the under-actuated pendulum swing up and analyze the performances of the suggested algorithms. It turns out that GPDP uses data very efficiently and can be applied to problems, where classic dynamic programming would be cumbersome.

Comment: code.

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, and David L. Wild. Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):615-628, 2009, doi 10.1109/TCBB.2007.70269.

Abstract: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

Carl Edward Rasmussen and Marc Peter Deisenroth. Probabilistic inference for fast learning in control. In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, Recent Advances in Reinforcement Learning, volume 5323 of Lecture Notes in Computer Science (LNCS), pages 229-242, Villeneuve d'Ascq, France, November 2008. Springer-Verlag.

Abstract: We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations.

Comment: videos and more. slides.

Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035-2078, October 2008.

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

Marc Peter Deisenroth, Jan Peters, and Carl Edward Rasmussen. Approximate dynamic programming with Gaussian processes. In 2008 American Control Conference (ACC 2008), pages 4480-4485, Seattle, WA, USA, June 2008.

Abstract: In general, it is difficult to determine an optimal closed-loop policy in nonlinear control problems with continuous-valued state and control domains. Hence, approximations are often inevitable. The standard method of discretizing states and controls suffers from the curse of dimensionality and strongly depends on the chosen temporal sampling rate. The paper introduces Gaussian Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman recursion of the dynamic programming algorithm are modeled using Gaussian processes. GPDP returns an optimal state-feedback for a finite set of states. Based on these outcomes, we learn a possibly discontinuous closed-loop policy on the entire state space by switching between two independently trained Gaussian processes.

Comment: code.

Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters. Model-based reinforcement learning with continuous states and actions. In Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN 2008), pages 19-24, Bruges, Belgium, April 2008.

Abstract: Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment.

Comment: code. slides

Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, and Robert C. Williamson. The need for open source software in machine learning. Journal of Machine Learning Research, 8:2443-2466, October 2007.

Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not realized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Christopher K. I. Williams. Approximation methods for Gaussian process regression. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large-Scale Kernel Machines, Neural Information Processing, pages 203-223. The MIT Press, Cambridge, MA, USA, September 2007.

Abstract: A wealth of computationally efficient approximation methods for Gaussian process regression have been recently proposed. We give a unifying overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a brief review of approximate matrix-vector multiplication methods.

Comment: book

Dilan Görür, Frank Jäkel, and Carl Edward Rasmussen. A choice model with infinitely many latent features. In W. W. Cohen and Andrew Moore, editors, 23rd International Conference on Machine Learning, pages 361-368, New York, NY, USA, June 2006. ACM Press, doi 10.1145/1143844.1143890.

Abstract: Elimination by aspects (EBA) is a probabilistic choice model describing how humans decide between several options. The options from which the choice is made are characterized by binary features and associated weights. For instance, when choosing which mobile phone to buy the features to consider may be: long lasting battery, color screen, etc. Existing methods for inferring the parameters of the model assume pre-specified features. However, the features that lead to the observed choices are not always known. Here, we present a non-parametric Bayesian model to infer the features of the options and the corresponding weights from choice data. We use the Indian buffet process (IBP) as a prior over the features. Inference using Markov chain Monte Carlo (MCMC) in conjugate IBP models has been previously described. The main contribution of this paper is an MCMC algorithm for the EBA model that can also be used in inference for other non-conjugate IBP models-this may broaden the use of IBP priors considerably.

Malte Kuß and Carl Edward Rasmussen. Assessing approximations for Gaussian process classification. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 699-706, Cambridge, MA, USA, April 2006. The MIT Press.

Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace's method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.

Tobias Pfingsten, Daniel Herrmann, and Carl Edward Rasmussen. Model-based design analysis and yield optimization. IEEE Transactions on Semiconductor Manufacturing, 19(4):475-486, 2006, doi 10.1109/TSM.2006.883589.

Abstract: Fluctuations are inherent to any fabrication process. Integrated circuits and microelectromechanical systems are particularly affected by these variations, and due to high-quality requirements the effect on the devices' perform ance has to be understood quantitatively. In recent years, it has become possible to model the performance of such complex systems on the basis of design specifications, and model-based sensitivity analysis has made its way into industrial engineering. We show how an efficient Bayesian approach, using a Gaussian process prior, can replace the commonly used brute-force Monte Carlo scheme, making it possible to apply the analysis to computationally costly models. We introduce a number of global, statistically justified sensitivity measures for design analysis and optimization. Two models of integrated systems serve us as case studies to introduce the analysis and to assess its convergence properties. We show that the Bayesian Monte Carlo scheme can save costly simulation runs and can ensure a reliable accuracy of the analysis.

Comment: Winner of the 2006 Best Paper Award for the journal.

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier Bousquet, and Bernhard Schölkopf. Evaluating predictive uncertainty challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, and F. d'Alché-Buc, editors, Machine Learning Challenges. Evaluating predictive uncertainty, visual object classification and recognising tectual entailment. First PASCAL Machine Learning Challenges Workshop, volume 3944 of Lecture Notes in Computer Science (LNCS), pages 1-27, Berlin, Germany, 04 2006. Springer, doi 10.1007/11736790_1.

Abstract: This Chapter presents the PASCAL1 Evaluating Predictive Uncertainty Challenge, introduces the contributed Chapters by the participants who obtained outstanding results, and provides a discussion with some lessons to be learnt. The Challenge was set up to evaluate the ability of Machine Learning algorithms to provide good "probabilistic predictions", rather than just the usual "point predictions" with no measure of uncertainty, in regression and classification problems. Participants had to compete on a number of regression and classification tasks, and were evaluated by both traditional losses that only take into account point predictions and losses we proposed that evaluate the quality of the probabilistic predictions.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The MIT Press, 2006.

Abstract: Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in the machine-learning community over the past decade, and this book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics.

Comment: Winner of the 2009 DeGroot Prize. Book web page, chapters and entire book pdf. GPML Toolbox.

Malte Kuß, Tobias Pfingsten, Lehel Csatò, and Carl Edward Rasmussen. Approximate inference for robust Gaussian process regression. Technical Report 136, Max Planck Institute for Biological Cybernetics, Tübingen, Germany, 2005.

Abstract: Gaussian process (GP) priors have been successfully used in non-parametric Bayesian regression and classification models. Inference can be performed analytically only for the regression model with Gaussian noise. For all other likelihood models inference is intractable and various approximation techniques have been proposed. In recent years expectation-propagation (EP) has been developed as a general method for approximate inference. This article provides a general summary of how expectation-propagation can be used for approximate inference in Gaussian process models. Furthermore we present a case study describing its implementation for a new robust variant of Gaussian process regression. To gain further insights into the quality of the EP approximation we present experiments in which we compare to results obtained by Markov chain Monte Carlo (MCMC) sampling.

Malte Kuß and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679-1704, 2005.

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace's method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace's method.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen. Analysis of some methods for reduced rank Gaussian process regression. In R. Murray-Smith and R. Shorten, editors, Switching and Learning in Feedback Systems, pages 98-127. Springer, Berlin, Heidelberg, 2005.

Abstract: While there is strong motivation for using Gaussian Processes (GPs) due to their excellent performance in regression and classification problems, their computational complexity makes them impractical when the size of the training set exceeds a few thousand cases. This has motivated the recent proliferation of a number of cost-effective approximations to GPs, both for classification and for regression. In this paper we analyze one popular approximation to GPs for regression: the reduced rank approximation. While generally GPs are equivalent to infinite linear models, we show that Reduced Rank Gaussian Processes (RRGPs) are equivalent to finite sparse linear models. We also introduce the concept of degenerate GPs and show that they correspond to inappropriate priors. We show how to modify the RRGP to prevent it from being degenerate at test time. Training RRGPs consists both in learning the covariance function hyperparameters and the support set. We propose a method for learning hyperparameters for a given support set. We also review the Sparse Greedy GP (SGGP) approximation (Somla and Bartlett, 2001), which is a way of learning the support set for given hyperparameters based on approximating the posterior. We propose an alternative method to the SGGP that has better generalization capabilities. Finally we make experiments to compare the different ways of training a RRGP. We provide some Matlab code for learning RRGPs.

Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939-1959, 2005.

Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

Carl Edward Rasmussen and Joaquin Quiñonero-Candela. Healing the Relevance Vector Machine through augmentation. In L. De Raedt and S. Wrobel, editors, 22nd International Conference on Machine Learning, pages 689-696, 2005.

Abstract: The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel method. It provides full predictive distributions for test cases. However, the predictive uncertainties have the unintuitive property, that they get smaller the further you move away from the training cases. We give a thorough analysis. Inspired by the analogy to non-degenerate Gaussian Processes, we suggest augmentation to solve the problem. The purpose of the resulting model, RVM*, is primarily to corroborate the theoretical and experimental analysis. Although RVM* could be used in practical applications, it is no longer a truly sparse model. Experiments show that sparsity comes at the expense of worse predictive distributions.

Carl Edward Rasmussen and Malte Kuß. Gaussian processes in reinforcement learning. In S. Thrun, L.K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 751-759, Cambridge, MA, USA, December 2004. The MIT Press.

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian processes. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 337-344, Cambridge, MA, USA, December 2004. The MIT Press.

Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and David L. Wild. Clustering protein sequence and structure space with infinite Gaussian mixture models. In Pacific Symposium on Biocomputing 2004, pages 399-410, Singapore, 2004. World Scientific Publishing.

Abstract: We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, and David L. Wild. Clustering protein sequence and structure space with infinite Gaussian mixture models. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, and Teri E. Klein, editors, Pacific Symposium on Biocomputing, pages 399-410. World Scientific, 2004.

Abstract: We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/approximately wid/PSB04/index.html

Jan Eichhorn, Andreas S. Tolias, Alexander Zien, Malte Kuß, Carl Edward Rasmussen, Jason Weston, Nikos K. Logothetis, and Bernhard Schölkopf. Prediction on spike data using kernel algorithms. In Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16, volume 16, pages 1367-1374, Cambridge, MA, USA, 2004. The MIT Press.

Abstract: We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms.

Matthias O. Franz, Younghee Kwon, Carl Edward Rasmussen, and Bernhard Schölkopf. Semi-supervised kernel regression using whitened function classes. In C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, and B. Schölkopf, editors, Lecture Notes in Computer Science (LNCS), volume 3175, pages 18-26, Berlin, Germany, 2004. Springer.

Abstract: The use of non-orthonormal basis functions in ridge regression leads to an often undesired non-isotropic prior in function space. In this study, we investigate an alternative regularization technique that results in an implicit whitening of the basis functions by penalizing directions in function space with a large prior variance. The regularization term is computed from unlabelled input data that characterizes the input distribution. Tests on two datasets using polynomial basis functions showed an improved average performance compared to standard ridge regression.

Dilan Görür, Carl Edward Rasmussen, Andreas S. Tolias, Fabian Sinz, and Nikos K. Logothetis. Modelling spikes with mixtures of factor analysers. In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese, editors, DAGM 2004, volume 3175 of Lecture Notes in Computer Science (LNCS), pages 391-398, Berlin, Germany, 09 2004. Springer.

Abstract: Identifying the action potentials of individual neurons from extracellular recordings, known as spike sorting, is a challenging problem. We consider the spike sorting problem using a generative model,mixtures of factor analysers, which concurrently performs clustering and feature extraction. The most important advantage of this method is that it quantifies the certainty with which the spikes are classified. This can be used as a means for evaluating the quality of clustering and therefore spike isolation. Using this method, nearly simultaneously occurring spikes can also be modelled which is a hard task for many of the spike sorting methods. Furthermore, modelling the data with a generative model allows us to generate simulated data.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe Girard. Gaussian process model based predictive control. In American Control Conference, pages 2214-2219, 2004.

Abstract: Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identi cation of non-linear dynamic systems. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. Gaussian process models contain noticeably less coef cients to be optimised. This paper illustrates possible application of Gaussian process models within model-based predictive control. The extra information provided within Gaussian process model is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on control of pH process benchmark.

Carl Edward Rasmussen. Gaussian processes in machine learning. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors, Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2 - 14, 2003, Tübingen, Germany, August 4 - 16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science (LNCS), pages 63-71. Springer-Verlag, Heidelberg, 2004.

Abstract: We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work.

Comment: Copyright by Springer, springerlink

Fabian Sinz, Joaquin Quiñonero-Candela, Gökhan H. Bakir, Carl Edward Rasmussen, and Matthias O. Franz. Learning depth from stereo. In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese, editors, 26th DAGM Symposium, volume 3175 of Lecture Notes in Computer Science (LNCS), pages 245-252, Berlin, Germany, 09 2004. Springer.

Abstract: We compare two approaches to the problem of estimating the depth of a point in space from observing its image position in two different cameras: 1. The classical photogrammetric approach explicitly models the two cameras and estimates their intrinsic and extrinsic parameters using a tedious calibration procedure; 2. A generic machine learning approach where the mapping from image to spatial coordinates is directly approximated by a Gaussian Process regression. Our results show that the generic learning approach, in addition to simplifying the procedure of calibration, can lead to higher depth accuracies than classical calibration although no specific domain knowledge is used.

Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero-Candela, and Roderick Murray-Smith. Gaussian process priors with uncertain inputs - application to multiple-step ahead time series forecasting. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 529-536, Cambridge, MA, USA, December 2003. The MIT Press.

Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. k-step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form yt = f(yt-1,...,yt-L), the prediction of y at time t + k is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction.

Carl Edward Rasmussen and Zoubin Ghahramani. Bayesian Monte Carlo. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 489-496, Cambridge, MA, USA, December 2003. The MIT Press.

Abstract: We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

Ercan Solak, Roderick Murray-Smith, William E. Leithead, Douglas Leith, and Carl Edward Rasmussen. Derivative observations in Gaussian process models of dynamic systems. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1033-1040, Cambridge, MA, USA, December 2003. The MIT Press.

Abstract: Gaussian processes provide an approach to nonparametric modelling which allows a straightforward combination of function and derivative observations in an empirical model. This is of particular importance in identification of nonlinear dynamic systems from experimental data. 1) It allows us to combine derivative information, and associated uncertainty with normal function observations into the learning and inference process. This derivative information can be in the form of priors specified by an expert or identified from perturbation data close to equilibrium. 2) It allows a seamless fusion of multiple local linear models in a consistent manner, inferring consistent models and ensuring that integrability constraints are met. 3) It improves dramatically the computational efficiency of Gaussian process models for dynamic system identification, by summarising large quantities of near-equilibrium data by a handful of linearisations, reducing the training set size - traditionally a problem for Gaussian process models.

Roderick Murray-Smith, Daniel Sbarbaro, Carl Edward Rasmussen, and Agathe Girard. Adaptive, cautious, predictive control with Gaussian process priors. In P. Van den Hof, B. Wahlberg, and S. Weiland, editors, IFAC SYSID 2003, pages 1195-1200, Oxford, UK, August 2003. Elsevier Science Ltd.

Abstract: Nonparametric Gaussian Process models, a Bayesian statistics approach, are used to implement a nonlinear adaptive control law. Predictions, including propagation of the state uncertainty are made over a k-step horizon. The expected value of a quadratic cost function is minimised, over this prediction horizon, without ignoring the variance of the model predictions. The general method and its main features are illustrated on a simulation example.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In ICASSP 2003, volume 2, pages 701-704, April 2003.

Abstract: The object of Bayesian modelling is the predictive distribution, which in a forecasting scenario enables improved estimates of forecasted values and their uncertainties. In this paper we focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian Process and the Relevance Vector Machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.

Juš Kocijan, Blaž Banko, Bojan Likar, Agathe Girard, Roderick Murray-Smith, and Carl Edward Rasmussen. A case based comparison of identification with neural network and Gaussian process models. In IFAC Internaltional Conference on Intelligent Control Systems and Signal Processing, volume 1, pages 137-142, 2003.

Abstract: In this paper an alternative approach to black-box identification of non-linear dynamic systems is compared with the more established approach of using artificial neural networks. The Gaussian process prior approach is a representative of non-parametric modelling approaches. It was compared on a pH process modelling case study. The purpose of modelling was to use the model for control design. The comparison revealed that even though Gaussian process models can be effectively used for modelling dynamic systems caution has to be axercised when signals are selected.

Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Bojan Likar. Predictive control with Gaussian process models. In B. Zajc and M. Tkal, editors, IEEE Region 8 Eurocon 2003: Computer as a Tool, pages 352-356, 2003.

Abstract: This paper describes model-based predictive control based on Gaussian processes. Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identification of non-linear dynamic systems. It offers more insight in variance of obtained model response, as well as fewer parameters to determine than other models. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. This property is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on a simulated example of nonlinear system.

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, and Carl Edward Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In C. Molina, T. Adali, J. Larsen, M. Van Hulle, S. C. Douglas, and J. Rouat, editors, NNSP 2003, Piscataway, New Jersey, 2003. IEEE Press.

Abstract: The object of Bayesian modelling is the predictive distribution, which in a forecasting scenario enables improved estimates of forecasted values and their uncertainties. In this paper we focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian Process and the Relevance Vector Machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.

Comment: Electronic version of Quiñonero-Candela, Girard, Larsen and Rasmussen, 2003 which should have been presented at ICASSP 03, but was cancelled due to bird flu epidemic.

Joaquin Quiñonero-Candela, Agathe Girard, and Carl Edward Rasmussen. Prediction at an uncertain input for Gaussian processes and Relevance Vector Machines application to multiple-step ahead time-series prediction. Technical Report IMM-2003-18, Instititute for Mathemetical Modelling, DTU, 2003.

Comment: techreport

Carl Edward Rasmussen. Gaussian processes to speed up Hybrid Monte Carlo for expensive Bayesian integrals. In Bayesian Statistics 7, pages 651-659. Oxford University Press, 2003.

Abstract: Hybrid Monte Carlo (HMC) is often the method of choice for computing Bayesian integrals that are not analytically tractable. However the success of this method may require a very large number of evaluations of the (un-normalized) posterior and its partial derivatives. In situations where the posterior is computationally costly to evaluate, this may lead to an unacceptable computational load for HMC. I propose to use a Gaussian Process model of the (log of the) posterior for most of the computations required by HMC. Within this scheme only occasional evaluation of the actual posterior is required to guarantee that the samples generated have exactly the desired distribution, even if the GP model is somewhat inaccurate. The method is demonstrated on a 10 dimensional problem, where 200 evaluations suffice for the generation of 100 roughly independent points from the posterior. Thus, the proposed scheme allows Bayesian treatment of models with posteriors that are computationally demanding, such as models involving computer simulation.

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. The infinite hidden Markov model. In Advances in Neural Information Processing Systems 14, pages 577-584, Cambridge, MA, USA, December 2002. The MIT Press.

Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite - consider, for example, symbols being possible words appearing in English text.

Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of Gaussian process experts. In Advances in Neural Information Processing Systems 14, pages 881-888, Cambridge, MA, USA, December 2002. The MIT Press.

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

Irene K. Andersen, Anna Szymkowiak, Carl Edward Rasmussen, L. G. Hanson, J. R. Marstrand, H. B. W. Larsson, and Lars Kai Hansen. Perfusion quantification using Gaussian process deconvolution. Magnetic Resonance in Medicine, 48(2):351-361, 2002, doi 10.1002/mrm.10213.

Abstract: The quantification of perfusion using dynamic susceptibility contrast MR imaging requires deconvolution to obtain the residual impulse-response function (IRF). Here, a method using a Gaussian process for deconvolution, GPD, is proposed. The fact that the IRF is smooth is incorporated as a constraint in the method. The GPD method, which automatically estimates the noise level in each voxel, has the advantage that model parameters are optimized automatically. The GPD is compared to singular value decomposition (SVD) using a common threshold for the singular values and to SVD using a threshold optimized according to the noise level in each voxel. The comparison is carried out using artificial data as well as using data from healthy volunteers. It is shown that GPD is comparable to SVD variable optimized threshold when determining the maximum of the IRF, which is directly related to the perfusion. GPD provides a better estimate of the entire IRF. As the signal to noise ratio increases or the time resolution of the measurements increases, GPD is shown to be superior to SVD. This is also found for large distribution volumes.

Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and Volker Tresp. Observations on the Nyström method for Gaussian process prediction. Technical report, University of Edinburgh, 2002.

Abstract: A number of methods for speeding up Gaussian Process (GP) prediction have been proposed, including the Nyström method of Williams and Seeger (2001). In this paper we focus on two issues (1) the relationship of the Nyström method to the Subset of Regressors method (Poggio and Girosi 1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the Nyström approximation would be expected to provide a good approximation to exact GP regression.

Carl Edward Rasmussen and Zoubin Ghahramani. Occam's razor. In Advances in Neural Information Processing Systems 13, pages 294-300, Cambridge, MA, USA, December 2001. The MIT Press.

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

Pedro A. d. F. R. Højen-Sørensen, Carl Edward Rasmussen, and Lars Kai Hansen. Bayesian modelling of fMRI time series. In Advances in Neural Information Processing Systems 12, pages 754-760. The MIT Press, 2000.

Abstract: We present a Hidden Markov Model (HMM) for inferring the hidden psychological state (or neural activity) during single trial fMRI activation experiments with blocked task paradigms. Inference is based on Bayesian methodology, using a combination of analytical and a variety of Markov Chain Monte Carlo (MCMC) sampling techniques. The advantage of this method is that detection of short time learning effects between repeated trials is possible since inference is based only on single trial experiments.

Carl Edward Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems 12, pages 554-560. The MIT Press, 2000.

Abstract: In a Bayesian mixture model it is not necessary a priori to limit the number of components to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the "right" number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.

Carl Edward Rasmussen and Zoubin Ghahramani. Occam's razor. In Michael J. Kearns, Sara A. Solla, and David A. Cohn, editors, NIPS, pages 294-300. The MIT Press, 2000.

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

Carl Edward Rasmussen. Evaluation of Gaussian processes and other methods for non-linear regression. PhD thesis, University of Toronto, Department of Computer Science, Toronto, CANADA, 1996.

Abstract: This thesis develops two Bayesian learning methods relying on Gaussian processes and a rigorous statistical approach for evaluating such methods. In these experimental designs the sources of uncertainty in the estimated generalisation performances due to both variation in training and test sets are accounted for. The framework allows for estimation of generalisation performance as well as statistical tests of significance for pairwise comparisons. Two experimental designs are recommended and supported by the DELVE software environment.
Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret.
The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method.
The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.

Carl Edward Rasmussen. A practical Monte Carlo implementation of Bayesian learning. In Advances in Neural Information Processing Systems 8, pages 598-604, Cambridge, MA., USA, 1996. The MIT Press.

Abstract: A practical method for Bayesian training of feed-forward neural networks using sophisticated Monte Carlo methods is presented and evaluated. In reasonably small amounts of computer time this approach outperforms other state-of-the-art methods on 5 datalimited tasks from real world domains.

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike Revow, Zoubin Ghahramani, Rafal Kustra, and Robert Tibshirani. The DELVE manual, 1996.

Abstract: DELVE - Data for Evaluating Learning in Valid Experiments - is a collection of datasets from many sources, an environment within which this data can be used to assess the performance of methods for learning relationships from data, and a repository for the results of such experiments.

Comment: The delve website.

Chris K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in Neural Information Processing Systems 8, pages 514-520, Cambridge, MA., USA, 1996. The MIT Press.

Abstract: The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior over functions. We investigate the use of a Gaussian process prior over functions, which permits the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.

Lars Kai Hansen and Carl Edward Rasmussen. Pruning from adaptive regularization. Neural Computation, 6(6):1222-1231, 1994.

Abstract: Inspired by the recent upsurge of interest in Bayesian methods we consider adaptive regularization. A generalization based scheme for adaptation of regularization parameters is introduced and compared to Bayesian regularization. We show that pruning arises naturally within both adaptive regularization schemes. As model example we have chosen the simplest possible: estimating the mean of a random variable with known variance. Marked similarities are found between the two methods in that they both involve a "noise limit", below which they regularize with infinite weight decay, i.e., they prune. However, pruning is not always beneficial. We show explicitly that both methods in some cases may increase the generalization error. This corresponds to situations where the underlying assumptions of the regularizer are poorly matched to the environment.

Carl Edward Rasmussen and David J. Willshaw. Presynaptic and postsynaptic comptetition in models for the development of neuromuscular connections. Biological Cybernetics, 68(5):409-419, 1993, doi 10.1007/BF00198773.

Abstract: In the establishment of connections between nerve and muscle there is an initial stage when each muscle fibre is innervated by several different motor axons. Withdrawal of connections then takes place until each fibre has contact from just a single axon. The evidence suggests that the withdrawal process involves competition between nerve terminals. We examine in formal models several types of competitive mechanism that have been proposed for this phenomenon. We show that a model which combines competition for a presynaptic resource with competition for a postsynaptic resource is superior to others. This model accounts for many anatomical and physiological findings and has a biologically plausible implementation. Intrinsic withdrawal appears to be a side effect of the competitive mechanism rather than a separate non-competitive feature. The model's capabilities are confirmed by theoretical analysis and full scale computer simulations.


Dan Roy

Creighton Heaukulani and Daniel M. Roy. The combinatorial structure of beta negative binomial processes. Technical report, Dept. of Engineering, University of Cambridge, March 2014.

Abstract: We characterize the combinatorial structure of conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure. In Bayesian nonparametric applications, such processes have served as models for unknown multisets of a measurable space. Previous work has characterized random subsets arising from conditionally-i.i.d. sequences of Bernoulli processes with a common beta process base measure. In this case, the combinatorial structure is described by the Indian buffet process. Our results give a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide constructions for the beta negative binomial process that avoid a representation of the underlying beta process base measure.

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems 26, pages 1-9, Lake Tahoe, California, USA, December 2012.

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.

Daniel M. Roy. On the computability and complexity of Bayesian reasoning. In NIPS Workshop on Philosophy and Machine Learning, 2011.

Abstract: If we consider the claim made by some cognitive scientists that the mind performs Bayesian reasoning, and if we simultaneously accept the Physical Church-Turing thesis and thus believe that the computational power of the mind is no more than that of a Turing machine, then what limitations are there to the reasoning abilities of the mind? I give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics) and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity of Bayesian reasoning. In particular, we prove that conditional probability is in general not computable in the presence of continuous random variables. However, in light of additional structure in the prior distribution, such as the presence of certain types of noise, or of exchangeability, conditioning is possible. These results cover most of statistical practice. At the workshop on Logic and Computational Complexity, we presented results on the computational complexity of conditioning, embedding sharp-P-complete problems in the task of computing conditional probabilities for diffuse continuous random variables. This work complements older work. For example, under cryptographic assumptions, the computational complexity of producing samples and computing probabilities was separated by Ben-David, Chor, Goldreich and Luby. In recent work, we also make use of cryptographic assumptions to show that different representations of exchangeable sequences may have vastly different complexity. However, when faced with an adversary that is computational bounded, these different representations have the same complexity, highlighting the fact that knowledge representation and approximation play a fundamental role in the possibility and plausibility of Bayesian reasoning.

David Sontag and Daniel M. Roy. The Complexity of Inference in Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems 24, Cambridge, MA, USA, 2011. The MIT Press.

Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document's topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question.


Yunus Saatçi

E. Gilboa, Yunus Saatçi, and John P. Cunningham. Scaling multidimensional Gaussian processes using projected additive approximations. In 30th International Conference on Machine Learning, 2013.

Abstract: Exact Gaussian Process (GP) regression has O(N3) runtime for data size N, making it intractable for large N. Many algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure inherent in particular covariance functions, including GPs with implied Markov structure, and equispaced inputs (both enable O(N) runtime). However, these GP advances have not been extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests novel extensions of structured GPs to multidimensional inputs. We present new methods for additive GPs, showing a novel connection between the classic backfitting method and the Bayesian framework. To achieve optimal accuracy-complexity tradeoff, we extend this model with a novel variant of projection pursuit regression. Our primary result – projection pursuit Gaussian Process Regression – shows orders of magnitude speedup while preserving high accuracy. The natural second and third steps include non-Gaussian observations and higher dimensional equispaced grid methods. We introduce novel techniques to address both of these necessary directions. We thoroughly illustrate the power of these three advances on several datasets, achieving close performance to the naive Full GP at orders of magnitude less cost.

E. Gilboa, Yunus Saatçi, and John P. Cunningham. Scaling multidimensional inference for structured Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.

Comment: to appear

Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK, 2011.

Abstract: The generic inference and learning algorithm for Gaussian Process (GP) regression has O(N3) runtime and O(N2) memory complexity, where N is the number of observations in the dataset. Given the computational resources available to a present-day workstation, this implies that GP regression simply cannot be run on large datasets. The need to use non- Gaussian likelihood functions for tasks such as classification adds even more to the computational burden involved.
The majority of algorithms designed to improve the scaling of GPs are founded on the idea of approximating the true covariance matrix, which is usually of rank N, with a matrix of rank P, where P<<N. Typically, the true training set is replaced with a smaller, representative (pseudo-) training set such that a specific measure of information loss is minimized. These algorithms typically attain O(P2N) runtime and O(PN) space complexity. They are also general in the sense that they are designed to work with any covariance function. In essence, they trade off accuracy with computational complexity. The central contribution of this thesis is to improve scaling instead by exploiting any structure that is present in the covariance matrices generated by particular covariance functions. Instead of settling for a kernel-independent accuracy/complexity trade off, as is done in much the literature, we often obtain accuracies close to, or exactly equal to the full GP model at a fraction of the computational cost.
We define a structured GP as any GP model that is endowed with a kernel which produces structured covariance matrices. A trivial example of a structured GP is one with the linear regression kernel. In this case, given inputs living in RD, the covariance matrices generated have rank D - this results in significant computational gains in the usual case where D<<N. Another case arises when a stationary kernel is evaluated on equispaced, scalar inputs. This results in Toeplitz covariance matrices and all necessary computations can be carried out exactly in O(N log N).
This thesis studies four more types of structured GP. First, we comprehensively review the case of kernels corresponding to Gauss-Markov processes evaluated on scalar inputs. Using state-space models we show how (generalised) regression (including hyperparameter learning) can be performed in O(N log N) runtime and O(N) space. Secondly, we study the case where we introduce block structure into the covariance matrix of a GP time-series model by assuming a particular form of nonstationarity a priori. Third, we extend the efficiency of scalar Gauss-Markov processes to higher-dimensional input spaces by assuming additivity. We illustrate the connections between the classical backfitting algorithm and approximate Bayesian inference techniques including Gibbs sampling and variational Bayes. We also show that it is possible to relax the rather strong assumption of additivity without sacrificing O(N log N) complexity, by means of a projection-pursuit style GP regression model. Finally, we study the properties of a GP model with a tensor product kernel evaluated on a multivariate grid of inputs locations. We show that for an arbitrary (regular or irregular) grid the resulting covariance matrices are Kronecker and full GP regression can be implemented in O(N) time and memory usage.
We illustrate the power of these methods on several real-world regression datasets which satisfy the assumptions inherent in the structured GP employed. In many cases we obtain performance comparable to the generic GP algorithm. We also analyse the performance degradation when these assumptions are not met, and in several cases show that it is comparable to that observed for sparse GP methods. We provide similar results for regression tasks with non-Gaussian likelihoods, an extension rarely addressed by sparse GP techniques.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen. Gaussian process change point models. In 27th International Conference on Machine Learning, pages 927-934, Haifa, Israel, June 2010.

Abstract: We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points. The model can be used to locate change points in an online manner; and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected. We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages. We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.

Comment: poster, slides.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen. Adaptive sequential Bayesian change point detection. In Zaïd Harchaoui, editor, NIPS Workshop on Temporal Segmentation, Whistler, BC, Canada, December 2009.

Abstract: Real-world time series are often nonstationary with respect to the parameters of some underlying prediction model (UPM). Furthermore, it is often desirable to adapt the UPM to incoming regime changes as soon as possible, necessitating sequential inference about change point locations. A Bayesian algorithm for online change point detection (BOCPD) has been introduced recently by Adams and MacKay (2007). In this algorithm, uncertainty about the last change point location is updated sequentially, and is integrated out to make online predictions robust to parameter changes. BOCPD requires a set of fixed hyper-parameters which allow the user to fully specify the hazard function for change points and the prior distribution over the parameters of the UPM. In practice, finding the "right" hyper-parameters can be quite difficult. We therefore extend BOCPD by introducing hyper-parameter learning, without sacrificing the online nature of the algorithm. Hyper-parameter learning is performed by optimizing the marginal likelihood of the BOCPD model, a closed-form quantity which can be computed sequentially. We illustrate performance on three real-world datasets.

Comment: poster, slides.

Richard J. Gibbens and Yunus Saatçi. Data, modelling and inference in road traffic networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 366(1872):1907-1919, June 2008, doi 10.1098/rsta.2008.0020.

Abstract: In this paper, we study UK road traffic data and explore a range of modelling and inference questions that arise from them. For example, loop detectors on the M25 motorway record speed and flow measurements at regularly spaced locations as well as the entry and exit lanes of junctions. An exploratory study of these data helps us to better understand and quantify the nature of congestion on the road network. From a traveller's perspective it is crucially important to understand the overall journey times and we look at methods to improve our ability to predict journey times given access jointly to both real-time and historical loop detector data. Throughout this paper we will comment on related work derived from US freeway data.

Jurgen Van Gael, Yunus Saatçi, Yee-Whye Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden Markov model. In 25th International Conference on Machine Learning, volume 25, pages 1088-1095, Helsinki, Finland, 2008. ACM.

Abstract: The infinite hidden Markov model is a non-parametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite Hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.


Christian Steinrücken

Christian Steinruecken. Compressing sets and multisets of sequences. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagrista, and James A. Storer, editors, Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, 2014, doi 10.1109/DCC.2014.89.

Abstract: This article describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset's unordered structure. Multisets are a generalisation of sets where members are allowed to occur multiple times. A multiset can be encoded naively by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.


Rich Turner

Victoria Leong, Michael A Stone, Richard E Turner, and Usha Goswami. A role for amplitude modulation phase relationships in speech rhythm perception. Journal of the Acoustical Society of America, 136:366-381, 2014.

Abstract: Prosodic rhythm in speech [the alternation of "Strong" (S) and "weak" (w) syllables] is cued, among others, by slow rates of amplitude modulation (AM) within the speech envelope. However, it is unclear exactly which envelope modulation rates and statistics are the most important for the rhythm percept. Here, the hypothesis that the phase relationship between "Stress" rate ( 2 Hz) and "Syllable" rate ( 4 Hz) AMs provides a perceptual cue for speech rhythm is tested. In a rhythm judgment task, adult listeners identified AM tone-vocoded nursery rhyme sentences that carried either trochaic (S-w) or iambic patterning (w-S). Manipulation of listeners' rhythm perception was attempted by parametrically phase-shifting the Stress AM and Syllable AM in the vocoder. It was expected that a 1π radian phase-shift (half a cycle) would reverse the perceived rhythm pattern (i.e., trochaic -> iambic) whereas a 2π radian shift (full cycle) would retain the perceived rhythm pattern (i.e., trochaic -> trochaic). The results confirmed these predictions. Listeners judgments of rhythm systematically followed Stress-Syllable AM phase-shifts, but were unaffected by phase-shifts between the Syllable AM and the Sub-beat AM ( 14 Hz) in a control condition. It is concluded that the Stress-Syllable AM phase relationship is an envelope-based modulation statistic that supports speech rhythm perception.

Richard E. Turner and Maneesh Sahani. Decomposing signals into a sum of amplitude and frequency modulated sinusoids using probabilistic inference. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 2173-2176, march 2012, doi 10.1109/ICASSP.2012.6288343.

Abstract: There are many methods for decomposing signals into a sum of amplitude and frequency modulated sinusoids. In this paper we take a new estimation based approach. Identifying the problem as ill-posed, we show how to regularize the solution by imposing soft constraints on the amplitude and phase variables of the sinusoids. Estimation proceeds using a version of Kalman smoothing. We evaluate the method on synthetic and natural, clean and noisy signals, showing that it outperforms previous decompositions, but at a higher computational cost.

Richard E. Turner and Maneesh Sahani. Demodulation as probabilistic inference. Transactions on Audio, Speech and Language Processing, 19:2398-2411, 2011.

Abstract: Demodulation is an ill-posed problem whenever both carrier and envelope signals are broadband and unknown. Here, we approach this problem using the methods of probabilistic inference. The new approach, called Probabilistic Amplitude Demodulation (PAD), is computationally challenging but improves on existing methods in a number of ways. By contrast to previous approaches to demodulation, it satisfies five key desiderata: PAD has soft constraints because it is probabilistic; PAD is able to automatically adjust to the signal because it learns parameters; PAD is user-steerable because the solution can be shaped by user-specific prior information; PAD is robust to broad-band noise because this is modelled explicitly; and PAD’s solution is self-consistent, empirically satisfying a Carrier Identity property. Furthermore, the probabilistic view naturally encompasses noise and uncertainty, allowing PAD to cope with missing data and return error bars on carrier and envelope estimates. Finally, we show that when PAD is applied to a bandpass-filtered signal, the stop-band energy of the inferred carrier is minimal, making PAD well-suited to sub-band demodulation.

Richard E. Turner and Maneesh Sahani. Probabilistic amplitude and frequency demodulation. In Advances in Neural Information Processing Systems 24, pages 981-989. The MIT Press, 2011.

Abstract: A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequency-demodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings.

Richard E. Turner and Maneesh Sahani. Two problems with variational expectation maximisation for time-series models. In D. Barber, T. Cemgil, and S. Chiappa, editors, Bayesian Time series models, chapter 5, pages 109-130. Cambridge University Press, 2011.

Abstract: Variational methods are a key component of the approximate inference and learning toolbox. These methods fill an important middle ground, retaining distributional information about uncertainty in latent variables, unlike maximum a posteriori methods (MAP), and yet generally requiring less computational time than Monte Carlo Markov Chain methods. In particular the variational Expectation Maximisation (vEM) and variational Bayes algorithms, both involving variational optimisation of a free-energy, are widely used in time-series modelling. Here, we investigate the success of vEM in simple probabilistic time-series models. First we consider the inference step of vEM, and show that a consequence of the well-known compactness property of variational inference is a failure to propagate uncertainty in time, thus limiting the usefulness of the retained distributional information. In particular, the uncertainty may appear to be smallest precisely when the approximation is poorest. Second, we consider parameter learning and analytically reveal systematic biases in the parameters found by vEM. Surprisingly, simpler variational approximations (such a mean-field) can lead to less bias than more complicated structured approximations.

Richard E. Turner and Maneesh Sahani. Statistical inference for single- and multi-band probabilistic amplitude demodulation.. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5466-5469, 2010.

Abstract: Amplitude demodulation is an ill-posed problem and so it is natural to treat it from a Bayesian viewpoint, inferring the most likely carrier and envelope under probabilistic constraints. One such treatment is Probabilistic Amplitude Demodulation (PAD), which, whilst computationally more intensive than traditional approaches, offers several advantages. Here we provide methods for estimating the uncertainty in the PAD-derived envelopes and carriers, and for learning free-parameters like the time-scale of the envelope. We show how the probabilistic approach can naturally handle noisy and missing data. Finally, we indicate how to extend the model to signals which contain multiple modulators and carriers.

Richard E. Turner. Statistical Models for Natural Sounds. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, 2010.

Abstract: It is important to understand the rich structure of natural sounds in order to solve important tasks, like automatic speech recognition, and to understand auditory processing in the brain. This thesis takes a step in this direction by characterising the statistics of simple natural sounds. We focus on the statistics because perception often appears to depend on them, rather than on the raw waveform. For example the perception of auditory textures, like running water, wind, fire and rain, depends on summary-statistics, like the rate of falling rain droplets, rather than on the exact details of the physical source. In order to analyse the statistics of sounds accurately it is necessary to improve a number of traditional signal processing methods, including those for amplitude demodulation, time-frequency analysis, and sub-band demodulation. These estimation tasks are ill-posed and therefore it is natural to treat them as Bayesian inference problems. The new probabilistic versions of these methods have several advantages. For example, they perform more accurately on natural signals and are more robust to noise, they can also fill-in missing sections of data, and provide error-bars. Furthermore, free-parameters can be learned from the signal. Using these new algorithms we demonstrate that the energy, sparsity, modulation depth and modulation time-scale in each sub-band of a signal are critical statistics, together with the dependencies between the sub-band modulators. In order to validate this claim, a model containing co-modulated coloured noise carriers is shown to be capable of generating a range of realistic sounding auditory textures. Finally, we explored the connection between the statistics of natural sounds and perception. We demonstrate that inference in the model for auditory textures qualitatively replicates the primitive grouping rules that listeners use to understand simple acoustic scenes. This suggests that the auditory system is optimised for the statistics of natural sounds.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani. A structured model of video reproduces primary visual cortical organisation. PLoS Computational Biology, 5(9), 09 2009, doi 10.1371/journal.pcbi.1000495.

Abstract: The visual system must learn to infer the presence of objects and features in the world from the images it encounters, and as such it must, either implicitly or explicitly, model the way these elements interact to create the image. Do the response properties of cells in the mammalian visual system reflect this constraint? To address this question, we constructed a probabilistic model in which the identity and attributes of simple visual elements were represented explicitly and learnt the parameters of this model from unparsed, natural video sequences. After learning, the behaviour and grouping of variables in the probabilistic model corresponded closely to functional and anatomical properties of simple and complex cells in the primary visual cortex (V1). In particular, feature identity variables were activated in a way that resembled the activity of complex cells, while feature attribute variables responded much like simple cells. Furthermore, the grouping of the attributes within the model closely parallelled the reported anatomical grouping of simple cells in cat V1. Thus, this generative model makes explicit an interpretation of complex and simple cells as elements in the segmentation of a visual scene into basic independent features, along with a parametrisation of their moment-by-moment appearances. We speculate that such a segmentation may form the initial stage of a hierarchical system that progressively separates the identity and appearance of more articulated visual elements, culminating in view-invariant object recognition.

Jörg Lücke, Richard E. Turner, Maneesh Sahani, and Marc Henniges. Occlusive components analysis. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, nips22, pages 1069-1077. mit, 2009.

Abstract: We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learnt from an unlabelled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. However, we show that tractable approximations to Expectation Maximization (EM) can be found if the training images each contain only a small number of objects on average. In numerical experiments it is shown that these approximations recover the correct set of object parameters. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating causes. Experiments based on the standard bars benchmark test for object learning show that the algorithm performs well in comparison to other recent component extraction approaches. The model and the learning algorithm thus connect research on occlusion with the research field of multiple-causes component extraction methods.

Pietro Berkes, Richard E. Turner, and Maneesh Sahani. On sparsity and overcompleteness in image models. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, nips20, volume 20. mit, 2008.

Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete.

Richard E. Turner and Maneesh Sahani. Modeling natural sounds with modulation cascade processes. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, nips20, volume 20. mit, 2008.

Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences ($\sim1$s); phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants ($\sim 10$−$3$s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscience-inspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task.

Richard E. Turner and M Sahani. Probabilistic amplitude demodulation. In 7th International Conference on Independent Component Analysis and Signal Separation, pages 544-551, 2007.

Abstract: Auditory scene analysis is extremely challenging. One approach, perhaps that adopted by the brain, is to shape useful representations of sounds on prior knowledge about their statistical structure. For example, sounds with harmonic sections are common and so time-frequency representations are efficient. Most current representations concentrate on the shorter components. Here, we propose representations for structures on longer time-scales, like the phonemes and sentences of speech. We decompose a sound into a product of processes, each with its own characteristic time-scale. This demodulation cascade relates to classical amplitude demodulation, but traditional algorithms fail to realise the representation fully. A new approach, probabilistic amplitude demodulation, is shown to out-perform the established methods, and to easily extend to representation of a full demodulation cascade.

Richard E. Turner and Maneesh Sahani. A maximum-likelihood interpretation for slow feature analysis. nc, 19(4):1022-1038, 2007, doi http://dx.doi.org/10.1162/neco.2007.19.4.1022.

Abstract: The brain extracts useful features from a maelstrom of sensory information, and a fundamental goal of theoretical neuroscience is to work out how it does so. One proposed feature extraction strategy is motivated by the observation that the meaning of sensory data, such as the identity of a moving visual object, is often more persistent than the activation of any single sensory receptor. This notion is embodied in the slow feature analysis (SFA) algorithm, which uses “slowness” as an heuristic by which to extract semantic information from multi-dimensional time-series. Here, we develop a probabilistic interpretation of this algorithm showing that inference and learning in the limiting case of a suitable probabilistic model yield exactly the results of SFA. Similar equivalences have proved useful in interpreting and extending comparable algorithms such as independent component analysis. For SFA, we use the equivalent probabilistic model as a conceptual spring-board, with which to motivate several novel extensions to the algorithm.


Ryan Turner

David Duvenaud, Oren Rippel, Ryan P. Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, April 2014.

Abstract: Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.

Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and extrapolation. In 30th International Conference on Machine Learning, February 18 2013.

Abstract: Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density - the Fourier transform of a kernel - with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework.

Comment: arXiv:1302.4245

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, and Carl Edward Rasmussen. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865-1871, 2012.

Abstract: We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by nonparametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of "system identification" is more robust than finding point estimates of a parametric function representation. Our principled filtering/smoothing approach for GP dynamic systems is based on analytic moment matching in the context of the forward-backward algorithm. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.

Ryan D. Turner and Carl Edward Rasmussen. Model based learning of sigma points in unscented Kalman filtering. Neurocomputing, 80:47-53, 2012, doi 10.1016/j.neucom.2011.07.029.

Abstract: The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

Ryan Darby Turner. Gaussian processes for state space models and change point detection. PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK, 2011.

Abstract: This thesis details several applications of Gaussian processes (GPs) for enhanced time series modeling. We first cover different approaches for using Gaussian processes in time series problems. These are extended to the state space approach to time series in two different problems. We also combine Gaussian processes and Bayesian online change point detection (BOCPD) to increase the generality of the Gaussian process time series methods. These methodologies are evaluated on predictive performance on six real world data sets, which include three environmental data sets, one financial, one biological, and one from industrial well drilling.
Gaussian processes are capable of generalizing standard linear time series models. We cover two approaches: the Gaussian process time series model (GPTS) and the autoregressive Gaussian process (ARGP). We cover a variety of methods that greatly reduce the computational and memory complexity of Gaussian process approaches, which are generally cubic in computational complexity.
Two different improvements to state space based approaches are covered. First, Gaussian process inference and learning (GPIL) generalizes linear dynamical systems (LDS), for which the Kalman filter is based, to general nonlinear systems for nonparametric system identification. Second, we address pathologies in the unscented Kalman filter (UKF). We use Gaussian process optimization (GPO) to learn UKF settings that minimize the potential for sigma point collapse.
We show how to embed mentioned Gaussian process approaches to time series into a change point framework. Old data, from an old regime, that hinders predictive performance is automatically and elegantly phased out. The computational improvements for Gaussian process time series approaches are of even greater use in the change point framework. We also present a supervised framework learning a change point model when change point labels are available in training.
These mentioned methodologies significantly improve predictive performance on the diverse set of data sets selected.

Ryan Turner, Steven Bottone, and Zoubin Ghahramani. Fast online anomaly detection using scan statistics. In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors, Machine Learning for Signal Processing (MLSP 2010), pages 385-390, Kittilä, Finland, August 2010.

Abstract: We present methods to do fast online anomaly detection using scan statistics. Scan statistics have long been used to detect statistically significant bursts of events. We extend the scan statistics framework to handle many practical issues that occur in application: dealing with an unknown background rate of events, allowing for slow natural changes in background frequency, the inverse problem of finding an unusual lack of events, and setting the test parameters to maximize power. We demonstrate its use on real and synthetic data sets with comparison to other methods.

Ryan Turner and Carl Edward Rasmussen. Model based learning of sigma points in unscented Kalman filtering. In Samuel Kaski, David J. Miller, Erkki Oja, and Antti Honkela, editors, Machine Learning for Signal Processing (MLSP 2010), pages 178-183, Kittilä, Finland, August 2010.

Abstract: The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for a step known as sigma point placement, causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

Yunus Saatçi, Ryan Turner, and Carl Edward Rasmussen. Gaussian process change point models. In 27th International Conference on Machine Learning, pages 927-934, Haifa, Israel, June 2010.

Abstract: We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points. The model can be used to locate change points in an online manner; and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected. We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages. We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.

Comment: poster, slides.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen. State-space inference and learning with Gaussian processes. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W & CP, pages 868-875, Chia Laguna, Sardinia, Italy, May 13-15 2010. Journal of Machine Learning Research.

Abstract: State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model.

Comment: poster.

Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen. System identification in Gaussian process dynamical systems. In Dilan Görür, editor, NIPS Workshop on Nonparametric Bayes, Whistler, BC, Canada, December 2009.

Comment: poster.

Ryan Turner, Yunus Saatçi, and Carl Edward Rasmussen. Adaptive sequential Bayesian change point detection. In Zaïd Harchaoui, editor, NIPS Workshop on Temporal Segmentation, Whistler, BC, Canada, December 2009.

Abstract: Real-world time series are often nonstationary with respect to the parameters of some underlying prediction model (UPM). Furthermore, it is often desirable to adapt the UPM to incoming regime changes as soon as possible, necessitating sequential inference about change point locations. A Bayesian algorithm for online change point detection (BOCPD) has been introduced recently by Adams and MacKay (2007). In this algorithm, uncertainty about the last change point location is updated sequentially, and is integrated out to make online predictions robust to parameter changes. BOCPD requires a set of fixed hyper-parameters which allow the user to fully specify the hazard function for change points and the prior distribution over the parameters of the UPM. In practice, finding the "right" hyper-parameters can be quite difficult. We therefore extend BOCPD by introducing hyper-parameter learning, without sacrificing the online nature of the algorithm. Hyper-parameter learning is performed by optimizing the marginal likelihood of the BOCPD model, a closed-form quantity which can be computed sequentially. We illustrate performance on three real-world datasets.

Comment: poster, slides.


Ed Snelson

Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approximations. In M. Meila and X. Shen, editors, 11th International Conference on Artificial Intelligence and Statistics. Omnipress, 2007.

Abstract: Gaussian process (GP) models are flexible probabilistic nonparametric models for regression, classification and other tasks. Unfortunately they suffer from computational intractability for large data sets. Over the past decade there have been many different approximations developed to reduce this cost. Most of these can be termed global approximations, in that they try to summarize all the training data via a small set of support points. A different approach is that of local regression, where many local experts account for their own part of space. In this paper we start by investigating the regimes in which these different approaches work well or fail. We then proceed to develop a new sparse GP approximation which is a combination of both the global and local approaches. Theoretically we show that it is derived as a natural extension of the framework developed by Quiñonero-Candela and Rasmussen for sparse GP approximations. We demonstrate the benefits of the combined approximation on some 1D examples for illustration, and on some large real-world data sets.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257-1264. The MIT Press, Cambridge, MA, 2006.

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M<<N, where N is the number of real data points, and hence obtain a sparse regression method which has O(NM2) training cost and O(M2) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M, i.e. very sparse solutions, and it significantly outperforms other approaches in this regime.

Edward Snelson and Zoubin Ghahramani. Variable noise and dimensionality reduction for sparse Gaussian processes. In R. Dechter and T. S. Richardson, editors, 22nd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2006.

Abstract: The sparse pseudo-input Gaussian process (SPGP) is a new approximation method for speeding up GP regression in the case of a large number of data points N. The approximation is controlled by the gradient optimization of a small set of M pseudo-inputs, thereby reducing complexity from O(N3) to O(NM2). One limitation of the SPGP is that this optimization space becomes impractically big for high dimensional data sets. This paper addresses this limitation by performing automatic dimensionality reduction. A projection of the input space to a low dimensional space is learned in a supervised manner, alongside the pseudo-inputs, which now live in this reduced space. The paper also investigates the suitability of the SPGP for modeling data with input-dependent noise. A further extension of the model is made to make it even more powerful in this regard - we learn an uncertainty parameter for each pseudo-input. The combination of sparsity, reduced dimension, and input-dependent noise makes it possible to apply GPs to much larger and more complex data sets than was previously practical. We demonstrate the benefits of these methods on several synthetic and real world problems.

Edward Snelson and Zoubin Ghahramani. Compact approximations to Bayesian predictive distributions. In 22nd International Conference on Machine Learning, Bonn, Germany, August 2005. Omnipress.

Abstract: We provide a general framework for learning precise, compact, and fast representations of the Bayesian predictive distribution for a model. This framework is based on minimizing the KL divergence between the true predictive density and a suitable compact approximation. We consider various methods for doing this, both sampling based approximations, and deterministic approximations such as expectation propagation. These methods are tested on a mixture of Gaussians model for density estimation and on binary linear classification, with both synthetic data sets for visualization and several real data sets. Our results show significant reductions in prediction time and memory footprint.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian processes. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 337-344, Cambridge, MA, USA, December 2004. The MIT Press.

Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.

Sinead Williamson

Sinead Williamson, Katherine A. Heller, C. Wang, and D. M. Blei. The IBP compound Dirichlet process and its application to focused topic modeling. In 27th International Conference on Machine Learning, pages 1151-1158, Haifa, Israel, June 2010.

Abstract: The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model - each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point's "focus", is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.

Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani. Dependent Indian buffet processes. In 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W & CP, pages 924-931, Chia Laguna, Sardinia, Italy, May 2010.

Abstract: Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.

Katherine A. Heller, Sinead Williamson, and Zoubin Ghahramani. Statistical models for partial membership. In Andrew McCallum and Sam Roweis, editors, 25th International Conference on Machine Learning, pages 392-399, Helsinki, Finland, July 2008. Omnipress.

Abstract: We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data (Gasch & Eisen, 2002). Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

Sinead Williamson and Zoubin Ghahramani. Probabilistic models for data combination in recommender systems. In Learning from Multiple Sources Workshop, NIPS Conference, Whistler Canada, 2008.

Andrew Gordon Wilson

Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to Gaussian processes. In AISTATS, JMLR Proceedings. JMLR.org, 2014.

Abstract: We investigate the Student-t process as an alternative to the Gaussian process as a nonparametric prior over functions. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating away an inverse Wishart process prior over the covariance kernel of a Gaussian process model. We show surprising equivalences between different hierarchical Gaussian process models leading to Student-t processes, and derive a new sampling scheme for the inverse Wishart process, which helps elucidate these equivalences. Overall, we show that a Student-t process can retain the attractive properties of a Gaussian process - a nonparametric representation, analytic marginal and predictive distributions, and easy model selection through covariance kernels - but has enhanced flexibility, and predictive covariances that, unlike a Gaussian process, explicitly depend on the values of training observations. We verify empirically that a Student-t process is especially useful in situations where there are changes in covariance structure, or in applications like Bayesian optimization, where accurate predictive covariances are critical for good performance. These advantages come at no additional computational cost over Gaussian processes.

Andrew Gordon Wilson. Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes. PhD thesis, University of Cambridge, Cambridge, UK, 2014.

Abstract: Truly intelligent systems are capable of pattern discovery and extrapolation without human intervention. Bayesian nonparametric models, which can uniquely represent expressive prior information and detailed inductive biases, provide a distinct opportunity to develop intelligent systems, with applications in essentially any learning and prediction task.
Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. A covariance kernel determines the support and inductive biases of a Gaussian process. In this thesis, we introduce new covariance kernels to enable fast automatic pattern discovery and extrapolation with Gaussian processes.
In the introductory chapter, we discuss the high level principles behind all of the models in this thesis: 1) we can typically improve the predictive performance of a model by accounting for additional structure in data; 2) to automatically discover rich structure in data, a model must have large support and the appropriate inductive biases; 3) we most need expressive models for large datasets, which typically provide more information for learning structure, and 4) we can often exploit the existing inductive biases (assumptions) or structure of a model for scalable inference, without the need for simplifying assumptions.
In the context of this introduction, we then discuss, in chapter 2, Gaussian processes as kernel machines, and my views on the future of Gaussian process research.
In chapter 3 we introduce the Gaussian process regression network (GPRN) framework, a multi-output Gaussian process method which scales to many output variables, and accounts for input-dependent correlations between the outputs. Underlying the GPRN is a highly expressive kernel, formed using an adaptive mixture of latent basis functions in a neural network like architecture. The GPRN is capable of discovering expressive structure in data. We use the GPRN to model the time-varying expression levels of 1000 genes, the spatially varying concentrations of several distinct heavy metals, and multivariate volatility (input dependent noise covariances) between returns on equity indices and currency exchanges, which is particularly valuable for portfolio allocation. We generalise the GPRN to an adaptive network framework, which does not depend on Gaussian processes or Bayesian nonparametrics; and we outline applications for the adaptive network in nuclear magnetic resonance (NMR) spectroscopy, ensemble learning, and change-point modelling.
In chapter 4 we introduce simple closed form kernel for automatic pattern discovery and extrapolation. These spectral mixture (SM) kernels are derived by modelling the spectral densiy of a kernel (its Fourier transform) using a scale-location Gaussian mixture. SM kernels form a basis for all stationary covariances, and can be used as a drop-in replacement for standard kernels, as they retain simple and exact learning and inference procedures. We use the SM kernel to discover patterns and perform long range extrapolation on atmospheric CO2 trends and airline passenger data, as well as on synthetic examples. We also show that the SM kernel can be used to automatically reconstruct several standard covariances. The SM kernel and the GPRN are highly complementary; we show that using the SM kernel with adaptive basis functions in a GPRN induces an expressive prior over non-stationary kernels.
In chapter 5 we introduce GPatt, a method for fast multidimensional pattern extrapolation, particularly suited to imge and movie data. Without human intervention - no hand crafting of kernel features, and no sophisticated initialisation procedures - we show that GPatt can solve large scale pattern extrapolation, inpainting and kernel discovery problems, including a problem with 383,400 training points. GPatt exploits the structure of a spectral mixture product (SMP) kernel, for fast yet exact inference procedures. We find that GPatt significantly outperforms popular alternative scalable gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits existing model structure are useful in combination for modelling large scale multidimensional patterns.
The models in this dissertation have proven to be scalable and with greatly enhanced predictive performance over the alternatives: the extra structure being modelled is an important part of a wide variety of real data - including problems in econometrics, gene expression, geostatistics, nuclear magnetic resonance spectroscopy, ensemble learning, multi-output regression, change point modelling, time series, multivariate volatility, image inpainting, texture extrapolation, video extrapolation, acoustic modelling, and kernel discovery.

Andrew Gordon Wilson, Yuting Wu, Daniel J. Holland, Sebastian Nowozin, Mick D. Mantle, Lynn F. Gladden, and Andrew Blake. Bayesian inference for NMR spectroscopy with applications to chemical quantification. arXiv preprint arXiv 1402.3580, 2014.

Abstract: Nuclear magnetic resonance (NMR) spectroscopy exploits the magnetic properties of atomic nuclei to discover the structure, reaction state and chemical environment of molecules. We propose a probabilistic generative model and inference procedures for NMR spectroscopy. Specifically, we use a weighted sum of trigonometric functions undergoing exponential decay to model free induction decay (FID) signals. We discuss the challenges in estimating the components of this general model - amplitudes, phase shifts, frequencies, decay rates, and noise variances - and offer practical solutions. We compare with conventional Fourier transform spectroscopy for estimating the relative concentrations of chemicals in a mixture, using synthetic and experimentally acquired FID signals. We find the proposed model is particularly robust to low signal to noise ratios (SNR), and overlapping peaks in the Fourier transform of the FID, enabling accurate predictions (e.g., 1% error at low SNR) which are not possible with conventional spectroscopy (5% error).

Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and extrapolation. In 30th International Conference on Machine Learning, February 18 2013.

Abstract: Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density - the Fourier transform of a kernel - with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework.

Comment: arXiv:1302.4245

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham. Gpatt: Fast multidimensional pattern extrapolation with Gaussian processes. arXiv preprint arXiv:1310.5288, 2013.

Abstract: Gaussian processes are typically used for smoothing and interpolation on small datasets. We introduce a new Bayesian nonparametric framework - GPatt - enabling automatic pattern extrapolation with Gaussian processes on large multidimensional datasets. GPatt unifies and extends highly expressive kernels and fast exact inference techniques. Without human intervention - no hand crafting of kernel features, and no sophisticated initialisation procedures - we show that GPatt can solve large scale pattern extrapolation, inpainting, and kernel discovery problems, including a problem with 383,400 training points. We find that GPatt significantly outperforms popular alternative scalable Gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits model structure are useful in combination for modelling large scale multidimensional patterns.

Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In 29th International Conference on Machine Learning, Edinburgh, Scotland, June 2012.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.

Andrew Gordon Wilson and Zoubin Ghahramani. Modelling input varying correlations between multiple responses. In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors, ECML/PKDD, volume 7524 of Lecture Notes in Computer Science, pages 858-861. Springer, 2012.

Abstract: We introduced a generalised Wishart process (GWP) for modelling input dependent covariance matrices Σ(x), allowing one to model input varying correlations and uncertainties between multiple response variables. The GWP can naturally scale to thousands of response variables, as opposed to competing multivariate volatility models which are typically intractable for greater than 5 response variables. The GWP can also naturally capture a rich class of covariance dynamics - periodicity, Brownian motion, smoothness, - through a covariance kernel.

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani. Gaussian process regression networks. Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering, University of Cambridge, Cambridge, UK, October 19 2011.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

Comment: arXiv:1110.4411

Andrew Gordon Wilson and Zoubin Ghahramani. Generalised Wishart processes. In 27th Conference on Uncertainty in Artificial Intelligence, 2011.

Abstract: We introduce a new stochastic process called the generalised Wishart process (GWP). It is a collection of positive semi-definite random matrices indexed by any arbitrary input variable. We use this process as a prior over dynamic (e.g. time varying) covariance matrices. The GWP captures a diverse class of covariance dynamics, naturally hanles missing data, scales nicely with dimension, has easily interpretable parameters, and can use input variables that include covariates other than time. We describe how to construct the GWP, introduce general procedures for inference and prediction, and show that it outperforms its main competitor, multivariate GARCH, even on financial data that especially suits GARCH.

Comment: Supplementary Material, Best Student Paper Award

Andrew Gordon Wilson and Zoubin Ghahramani. Copula processes. In Advances in Neural Information Processing Systems 23, 2010. Spotlight.

Abstract: We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures.

Comment: Supplementary Material, slides.