# Non-parametric Bayesian Methods

Non-parametric models are very flexible statistical models in which the complexity of the model grows with the amount of observed data. While traditional parametric models make strong assumptions about how the data was generated, non-parametric models try to make weaker assumptions and let the data "speak for itself". Many non-parametric models can be seen as infinite limits of finite parametric models, and an important family of non-parametric models are derived from Dirichlet processes. See also Gaussian Processes.

#### Archipelago: nonparametric Bayesian semi-supervised learning

R. Adams, Zoubin Ghahramani, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

Semi-supervised learning (SSL), is classification where additional unlabeled data can be used to improve accuracy. Generative approaches are appealing in this situation, as a model of the data’s probability density can assist in identifying clusters. Nonparametric Bayesian methods, while ideal in theory due to their principled motivations, have been difficult to apply to SSL in practice. We present a nonparametric Bayesian method that uses Gaussian processes for the generative model, avoiding many of the problems associated with Dirichlet process mixture models. Our model is fully generative and we take advantage of recent advances in Markov chain Monte Carlo algorithms to provide a practical inference method. Our method compares favorably to competing approaches on synthetic and real-world multi-class data.

**Comment:** This paper was awarded Honourable Mention for Best Paper at ICML 2009.

#### Tree-Structured Stick Breaking for Hierarchical Data

R. P. Adams, Zoubin Ghahramani, Michael I. Jordan, 2010. (In Advances in Neural Information Processing Systems 23). The MIT Press.

Abstract▼ URL

Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.

#### Learning the Structure of Deep Sparse Graphical Models

R. P. Adams, H. Wallach, Zoubin Ghahramani, May 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Edited by Yee Whye Teh, Mike Titterington. Chia Laguna, Sardinia, Italy.

Abstract▼ URL

Deep belief networks are a powerful way to model complex probability distributions. However, it is difficult to learn the structure of a belief network, particularly one with hidden units. The Indian buffet process has been used as a nonparametric Bayesian prior on the structure of a directed belief network with a single infinitely wide hidden layer. Here, we introduce the cascading Indian buffet process (CIBP), which provides a prior on the structure of a layered, directed belief network that is unbounded in both depth and width, yet allows tractable inference. We use the CIBP prior with the nonlinear Gaussian belief network framework to allow each unit to vary its behavior between discrete and continuous representations. We use Markov chain Monte Carlo for inference in this model and explore the structures learned on image data.

**Comment:** Winner of the Best Paper Award

#### The Mondrian Kernel

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, Yee Whye Teh, June 2016. (In 32nd Conference on Uncertainty in Artificial Intelligence). Jersey City, New Jersey, USA.

Abstract▼ URL

We introduce the Mondrian kernel, a fast random feature approximation to the Laplace kernel. It is suitable for both batch and online learning, and admits a fast kernel-width-selection procedure as the random features can be re-used efficiently for all kernel widths. The features are constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and we highlight the connection to Mondrian forests [Lakshminarayanan et al., 2014], where trees are also sampled via a Mondrian process, but fit independently. This link provides a new insight into the relationship between kernel methods and random forests.

**Comment:** [Supplementary Material] [arXiv] [Poster] [Slides] [Code]

#### Sampling and inference for discrete random probability measures in probabilistic programs

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria Lomeli, Zoubin Ghahramani, December 2017. (In NIPS workshop on Advances in Approximate Inference). California, United States.

Abstract▼ URL

We consider the problem of sampling a sequence from a discrete random prob- ability measure (RPM) with countable support, under (probabilistic) constraints of finite memory and computation. A canonical example is sampling from the Dirichlet Process, which can be accomplished using its well-known stick-breaking representation and lazy initialization of its atoms. We show that efficiently lazy initialization is possible if and only if a size-biased representation of the discrete RPM is known. For models constructed from such discrete RPMs, we consider the implications for generic particle-based inference methods in probabilistic program- ming systems. To demonstrate, we implement posterior inference for Normalized Inverse Gaussian Process mixture models in Turing.

#### Bayesian two-sample tests

Karsten M. Borgwardt, Zoubin Ghahramani, 2009. (arXiv).

Abstract▼ URL

In this paper, we present two classes of Bayesian approaches to the two-sample problem. Our first class of methods extends the Bayesian t-test to include all parametric models in the exponential family and their conjugate priors. Our second class of methods uses Dirichlet process mixtures (DPM) of such conjugate-exponential distributions as flexible nonparametric priors over the unknown distributions.

#### Scalable Gaussian Process Structured Prediction for Grid Factor Graph Applications

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, Zoubin Ghahramani, 2014. (In 31st International Conference on Machine Learning).

Abstract▼ URL

Structured prediction is an important and well studied problem with many applications across machine learning. GPstruct is a recently proposed structured prediction model that offers appealing properties such as being kernelised, non-parametric, and supporting Bayesian inference (Bratières et al. 2013). The model places a Gaussian process prior over energy functions which describe relationships between input variables and structured output variables. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners (predictors) trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes.

#### Scaling the iHMM: Parallelization versus Hadoop

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, Zoubin Ghahramani, 2010. (In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology). Bradford, UK. IEEE Computer Society. **DOI**: 10.1109/CIT.2010.223. **ISBN**: 978-0-7695-4108-2.

Abstract▼ URL

This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM Beal, Ghahramani and Rasmussen, 2002, in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program. We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.

#### Bayesian Lipschitz Constant Estimation and Quadrature

Jan-Peter Calliess, December 2015. (In Workshop on Probabilistic Integration, NIPS). Montreal, Canada.

Abstract▼ URL

Lipschitz quadrature methods provide an approach to one-dimensional numerical integration on bounded domains. On the basis of the assumption that the integrand is Lipschitz continuous with a known Lipschitz constant, these quadrature rules can provide a tight error bound around their integral estimates and utilise the Lipschitz constant to guide exploration in the context of adaptive quadrature. In this paper, we outline our ongoing work on extending this approach to settings where the Lipschitz constant is probabilistically uncertain. As the key component, we introduce a Bayesian approach for updating a subjectively probabilistic belief of the Lipschitz constant. Combined with any Lipschitz quadrature rule, we obtain an approach for translating a sample into an integral estimate with probabilistic uncertainty intervals. The paper concludes with an illustration of the approach followed by a discussion of open issues and future work.

#### Lazily Adapted Constant Kinky Inference for Nonparametric Regression and Model-Reference Adaptive Control

Jan-Peter Calliess, 2016. (arXiv).

Abstract▼ URL

Techniques known as Nonlinear Set Membership prediction, Lipschitz Interpolation or Kinky Inference are approaches to machine learning that utilise presupposed Lipschitz properties to compute inferences over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori they offer convergence guarantees as well as bounds around the predictions. Considering a more general setting that builds on Hölder continuity relative to pseudo-metrics, we propose an online method for estimating the Hoelder constant online from function value observations that possibly are corrupted by bounded observational errors. Utilising this to compute adaptive parameters within a kinky inference rule gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function in the limit of increasingly dense data to within a worst-case error bound that depends on the level of observational uncertainty. We apply our method in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks. For discrete-time systems, we provide stability guarantees for our learning-based controllers both for the batch and the online learning setting.

#### Lipschitz Optimisation for Lipschitz Interpolation

Jan-Peter Calliess, May 2017. (In 2017 American Control Conference (ACC 2017)). Seattle, WA, USA.

Abstract▼ URL

Techniques known as Nonlinear Set Membership prediction, Kinky Inference or Lipschitz Interpolation are fast and numerically robust approaches to nonparametric machine learning that have been proposed to be utilised in the context of system identification and learning-based control. They utilise presupposed Lipschitz properties in order to compute inferences over unobserved function values. Unfortunately, most of these approaches rely on exact knowledge about the input space metric as well as about the Lipschitz constant. Furthermore, existing techniques to estimate the Lipschitz constants from the data are not robust to noise or seem to be ad-hoc and typically are decoupled from the ultimate learning and prediction task. To overcome these limitations, we propose an approach for optimising parameters of the presupposed metrics by minimising validation set prediction errors. To avoid poor performance due to local minima, we propose to utilise Lipschitz properties of the optimisation objective to ensure global optimisation success. The resulting approach is a new flexible method for nonparametric black-box learning. We illustrate its competitiveness on a set of benchmark problems.

#### Lazily Adapted Constant Kinky Inference for non-parametric regression and model-reference adaptive control

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2020. (Automatica). **DOI**: 10.1016/j.automatica.2020.109216.

Abstract▼

Techniques known as Nonlinear Set Membership prediction or Lipschitz Interpolation are approaches to supervised machine learning that utilise presupposed Lipschitz properties to perform inference over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori, they offer convergence guarantees, as well as bounds around the predictions. Considering a more general setting that builds on Lipschitz continuity, we propose an online method for estimating the Lipschitz constant online from function value observations that are possibly corrupted by bounded noise. Utilising this as a data-dependent hyper-parameter gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function on compact support in the limit of increasingly dense data, up to a worst-case error that can be bounded by the level of observational error. We also consider applications of our nonparametric regression method to learning-based control. For a class of discrete-time settings, we establish convergence guarantees on the closed-loop tracking error of our online learning-based controllers. To provide evidence that our method can be beneficial not only in theory but also in practice, we apply it in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks.

#### Understanding Local Linearisation in Variational Gaussian Process State Space Models

Talay M Cheema, 2021. (In Time Series Workshop at the 38th International Conference on Machine Learning).

Abstract▼ URL

We describe variational inference approaches in Gaussian process state space models in terms of local linearisations of the approximate posterior function. Most previous approaches have either assumed independence between the posterior dynamics and latent states (the mean-field (MF) approximation), or optimised free parameters for both, leading to limited scalability. We use our framework to prove that (i) there is a theoretical imperative to use non-MF approaches, to avoid excessive bias in the process noise hyperparameter estimate, and (ii) we can parameterise only the posterior dynamics without any less of performance. Our approach suggests further approximations, based on the existing rich literature on filtering and smoothing for nonlinear systems, and unifies approaches for discrete and continuous time models.

#### Contrasting Discrete and Continuous Methods for Bayesian System Identification

Talay M Cheema, 2022. (In Workshop on Continuous Time Machine Learning at the 39th International Conference on Machine Learning).

Abstract▼ URL

In recent years, there has been considerable interest in embedding continuous time methods in machine learning algorithms. In system identification, the task is to learn a dynamical model from incomplete observation data, and when prior knowledge is in continuous time – for example, mechanistic differential equation models – it seems natural to use continuous time models for learning. Yet when learning flexible, nonlinear, probabilistic dynamics models, most previous work has focused on discrete time models to avoid computational, numerical, and mathematical difficulties. In this work we show, with the aid of small-scale examples, that this mismatch between model and data generating process can be consequential under certain circumstances, and we discuss possible modifications to discrete time models which may better suit them to handling data generated by continuous time processes.

#### The Indian Buffet Process: Scalable Inference and Extensions

Finale Doshi-Velez, August 2009. University of Cambridge, Cambridge, UK.

Abstract▼ URL

Many unsupervised learning problems seek to identify hidden features from observations. In many real-world situations, the number of hidden features is unknown. To avoid specifying the number of hidden features a priori, one can use the Indian Buffet Process (IBP): a nonparametric latent feature model that does not bound the number of active features in a dataset. While elegant, the lack of efficient inference procedures for the IBP has prevented its application in large-scale problems. The core contribution of this thesis are three new inference procedures that allow inference in the IBP to be scaled from a few hundred to 100,000 observations. This thesis contains three parts: (1) An introduction to the IBP and a review of inference techniques and extensions. The first chapters summarise three constructions for the IBP and review all currently published inference techniques. Appendix C reviews extensions of the IBP to date. (2) Novel techniques for scalable Bayesian inference. This thesis presents three new inference procedures: (a) an accelerated Gibbs sampler for efficient Bayesian inference in a broad class of conjugate models, (b) a parallel, asynchronous Gibbs sampler that allows the accelerated Gibbs sampler to be distributed across multiple processors, and (c) a variational inference procedure for the IBP. (3) A framework for structured nonparametric latent feature models. We also present extensions to the IBP to model more sophisticated relationships between the co-occurring hidden features, providing a general framework for correlated non-parametric feature models.

#### Accelerated Gibbs sampling for the Indian buffet process

Finale Doshi-Velez, Zoubin Ghahramani, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a non-parametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

#### Accelerated sampling for the Indian Buffet Process

Finale Doshi-Velez, Zoubin Ghahramani, 2009. (In ICML). Edited by Andrea Pohoreckyj Danyluk, Léon Bottou, Michael L. Littman. acm. ACM International Conference Proceeding Series. **ISBN**: 978-1-60558-516-1.

Abstract▼ URL

We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a nonparametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

#### Large Scale Non-parametric Inference: Data Parallelisation in the Indian Buffet Process

Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani, December 2009. (In Advances in Neural Information Processing Systems 23). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

#### Variational inference for the Indian buffet process

F. Doshi-Velez, K.T. Miller, J. Van Gael, Y.W. Teh, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Clearwater Beach, FL, USA. Journal of Machine Learning Research.

Abstract▼ URL

The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on a truncated stick-breaking approximation, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes.

#### Variational Inference for the Indian Buffet Process

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh, April 2009. University of Cambridge, Computational and Biological Learning Laboratory, Department of Engineering.

Abstract▼ URL

The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on truncating to infinite models, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes. This technical report is a longer version of Doshi-Velez et al. (2009).

#### Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, David L. Wild, 2004. (In Pacific Symposium on Biocomputing 2004). (Pacific Symposium on Biocomputing 2004; Vol. 9). Singapore. The Big Island of Hawaii. World Scientific Publishing.

Abstract▼ URL

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

#### Structure Discovery in Nonparametric Regression through Compositional Kernel Search

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.

#### Avoiding pathologies in very deep networks

David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin Ghahramani, April 2014. (In 17th International Conference on Artificial Intelligence and Statistics). Reykjavik, Iceland.

Abstract▼ URL

Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.

#### Stick-breaking representations of sigma-Stable Poisson-Kingman models

Stefano Favaro, Maria Lomeli, Bernardo Nipoti, Yee Whye Teh, 2014. (Electronic Journal of Statistics).

Abstract▼ URL

In this paper we investigate the stick-breaking representation for the class of sigma-Stable Poisson-Kingman models, also known as Gibbs-type random probability measures. This class includes as special cases most of the discrete priors commonly used in Bayesian nonparametrics, such as the two parameter Poisson-Dirichlet process and the normalized generalized Gamma process. Under the assumption sigma=u/v, for any coprime integers 1 <= u < v such that u/v < 1/2, we show that a sigma-stable Poisson-Kingman model admits an explicit stick-breaking representation in terms of random variables which are obtained by suitably transforming Gamma random variables and products of independent Beta and Gamma random variables.

#### On a class of sigma-Stable Poisson-Kingman models and an effective marginalised sampler

Stefano Favaro, Maria Lomeli, Yee Whye Teh, 2015. (Statistics and Computing).

Abstract▼ URL

We investigate the use of a large class of discrete random probability measures, which is referred to as the class Q, , in the context of Bayesian nonparametric mixture modeling. The class Q encompasses both the the two-parameter Poisson?Dirichlet process and the normalized generalized Gamma process, thus allowing us to comparatively study the inferential advantages of these two well-known nonparametric priors. Apart from ahighly flexible parameterization, the distinguishing feature of the class Q is the availability of a tractable posterior distribution. This feature, in turn, leads to derive an efficient marginal MCMC algorithm for posterior sampling within the framework of mixture models. We demonstrate the efficacy of our modeling framework on both one-dimensional and multi-dimensional datasets.

#### Variational Gaussian Process State-Space Models

Roger Frigola, Yutian Chen, Carl Edward Rasmussen, 2014. (In Advances in Neural Information Processing Systems 27). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger.

Abstract▼ URL

State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series.

#### Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, Carl Edward Rasmussen, 2013. (In Advances in Neural Information Processing Systems 26). Edited by L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.

#### Identification of Gaussian Process State-Space Models with Particle Stochastic Approximation EM

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, Carl Edward Rasmussen, 2014. (In Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC)).

Abstract▼ URL

Gaussian process state-space models (GP-SSMs) are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional (hyper-)parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification.

#### Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

Roger Frigola, Carl Edward Rasmussen, 2013. (In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on).

Abstract▼ URL

We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system’s dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.

#### Pitfalls in the use of Parallel Inference for the Dirichlet Process

Yarin Gal, Zoubin Ghahramani, 2014. (In Proceedings of the 31th International Conference on Machine Learning (ICML-14)).

Abstract▼ URL

Recent work done by Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013) has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it – work which has been picked-up and implemented in several different fields. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these requirements (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence, analysing the load balance for the inference and showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation. We end with suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.

#### A Systematic Bayesian Treatment of the IBM Alignment Models

Yarin Gal, Phil Blunsom, 2013. (In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies). Association for Computational Linguistics.

Abstract▼ URL

The dominant yet ageing IBM and HMM word alignment models underpin most popular Statistical Machine Translation implementations in use today. Though beset by the limitations of implausible independence assumptions, intractable optimisation problems, and an excess of tunable parameters, these models provide a scalable and reliable starting point for inducing translation systems. In this paper we build upon this venerable base by recasting these models in the non-parametric Bayesian framework. By replacing the categorical distributions at their core with hierarchical Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we provide a more flexible formulation and sidestep the original heuristic optimisation techniques. The resulting models are highly extendible, naturally permitting the introduction of phrasal dependencies. We present extensive experimental results showing improvements in both AER and BLEU when benchmarked against Giza++, including significant improvements over IBM model 4.

#### Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Yarin Gal, Yutian Chen, Zoubin Ghahramani, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.

#### Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs

Yarin Gal, Richard Turner, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random variable. The random covariance function has a posterior, on which a variational distribution is placed. The variational distribution transforms the random covariance function to fit the data. We study the properties of our approximate inference, compare it to alternative ones, and extend it to the distributed and stochastic domains. Our approximation captures complex functions better than standard approaches and avoids over-fitting.

#### Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Yarin Gal, Mark van der Wilk, Carl Rasmussen, 2014. (In Advances in Neural Information Processing Systems 27). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

#### Bayesian nonparametrics and the probabilistic approach to modelling

Zoubin Ghahramani, 2013. (Philosophical Transactions of the Royal Society A).

Abstract▼ URL

Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian nonparametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics for modelling unknown functions, density estimation, clustering, time series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman’s coalescent, Dirichlet diffusion tress, and Wishart processes.

#### Probabilistic machine learning and artificial intelligence

Zoubin Ghahramani, 2015. (Nature). **DOI**: doi:10.1038/nature14541.

Abstract▼ URL

How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.

#### Bayesian nonparametric latent feature models (with discussion)

Z. Ghahramani, T.L. Griffiths, P. Sollich, July 2007. (In Bayesian Statistics 8). Edited by J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith, M. West. Oxford, UK. Oxford University Press.

Abstract▼ URL

We describe a flexible nonparametric approach to latent variable modelling in which the number of latent variables is unbounded. This approach is based on a probability distribution over equivalence classes of binary matrices with a finite number of rows, corresponding to the data points, and an unbounded number of columns, corresponding to the latent variables. Each data point can be associated with a subset of the possible latent variables, which we refer to as the latent features of that data point. The binary variables in the matrix indicate which latent feature is possessed by which data point, and there is a potentially infinite array of features. We derive the distribution over unbounded binary matrices by taking the limit of a distribution over N×K binary matrices as K→∞. We define a simple generative processes for this distribution which we call the Indian buffet process (IBP; Griffiths and Ghahramani, 2005, 2006) by analogy to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a single hyperparameter which controls both the number of feature per ob ject and the total number of features. We describe a two-parameter generalization of the IBP which has additional flexibility, independently controlling the number of features per object and the total number of features in the matrix. The use of this distribution as a prior in an infinite latent feature model is illustrated, and Markov chain Monte Carlo algorithms for inference are described.

**Comment:** Includes discussion by David Dunson, and rejoinder.

#### A Choice Model with Infinitely Many Latent Features

Dilan Görür, Frank Jäkel, Carl Edward Rasmussen, June 2006. (In 23rd International Conference on Machine Learning). Edited by W. W. Cohen, Andrew Moore. New York, NY, USA. Pittsburgh, PA, USA. ACM Press. **DOI**: 10.1145/1143844.1143890.

Abstract▼ URL

Elimination by aspects (EBA) is a probabilistic choice model describing how humans decide between several options. The options from which the choice is made are characterized by binary features and associated weights. For instance, when choosing which mobile phone to buy the features to consider may be: long lasting battery, color screen, etc. Existing methods for inferring the parameters of the model assume pre-specified features. However, the features that lead to the observed choices are not always known. Here, we present a non-parametric Bayesian model to infer the features of the options and the corresponding weights from choice data. We use the Indian buffet process (IBP) as a prior over the features. Inference using Markov chain Monte Carlo (MCMC) in conjugate IBP models has been previously described. The main contribution of this paper is an MCMC algorithm for the EBA model that can also be used in inference for other non-conjugate IBP models—this may broaden the use of IBP priors considerably.

#### Dirichlet Process Gaussian Mixture Models: Choice of the base distribution

Dilan Görür, Carl Edward Rasmussen, July 2010. (Journal of Computer Science and Technology). Beijing, China. Science Press. **DOI**: 10.1007/s11390-010-9355-8.

Abstract▼ URL

In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of finding the “correct” number of mixture components by assuming infinitely many components. In this paper Dirichlet process mixture (DPM) models are cast as infinite mixture models and inference using Markov chain Monte Carlo is described. The specification of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM). We compare computational efficiency and modeling performance of DPGMM defined using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational effort.

#### Infinite Latent Feature Models and the Indian Buffet Process

T. L. Griffiths, Z. Ghahramani, December 2006. (In Advances in Neural Information Processing Systems 18). Edited by Y. Weiss, B. Schölkopf, J. Platt. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset.

#### The Indian buffet process: An introduction and review

Thomas L. Griffiths, Zoubin Ghahramani, April 2011. (Journal of Machine Learning Research).

Abstract▼ URL

The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes.

#### Variational inference for nonparametric multiple clustering

Y. Guan, J. G. Dy, D. Niu, Z. Ghahramani, July 2010. (In KDD10 Workshop on Discovering, Summarizing, and Using Multiple Clusterings). Washington, DC, USA.

Abstract▼ URL

Most clustering algorithms produce a single clustering solution. Similarly, feature selection for clustering tries to find one feature subset where one interesting clustering solution resides. However, a single data set may be multi-faceted and can be grouped and interpreted in many different ways, especially for high dimensional data, where feature selection is typically needed. Moreover, different clustering solutions are interesting for different purposes. Instead of committing to one clustering solution, in this paper we introduce a probabilistic nonparametric Bayesian model that can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. We provide a variational inference approach to learn the features and clustering partitions in each view. Our model allows us not only to learn the multiple clusterings and views but also allows us to automatically learn the number of views and the number of clusters in each view.

#### Beta diffusion trees

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

#### Beta diffusion trees and hierarchical feature allocations

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, August 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

#### The combinatorial structure of beta negative binomial processes

Creighton Heaukulani, Daniel M. Roy, March 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We characterize the combinatorial structure of conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure. In Bayesian nonparametric applications, such processes have served as models for unknown multisets of a measurable space. Previous work has characterized random subsets arising from conditionally-i.i.d. sequences of Bernoulli processes with a common beta process base measure. In this case, the combinatorial structure is described by the Indian buffet process. Our results give a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide constructions for the beta negative binomial process that avoid a representation of the underlying beta process base measure.

#### A Nonparametric Bayesian Approach to Modeling Overlapping Clusters

Katherine A. Heller, Zoubin Ghahramani, 2007. (In AISTATS). Edited by Marina Meila, Xiaotong Shen. JMLR.org. JMLR Proceedings.

Abstract▼ URL

Although clustering data into mutually exclusive partitions has been an extremely successful approach to unsupervised learning, there are many situations in which a richer model is needed to fully represent the data. This is the case in problems where data points actually simultaneously belong to multiple, overlapping clusters. For example a particular gene may have several functions, therefore belonging to several distinct clusters of genes, and a biologist may want to discover these through unsupervised modeling of gene expression data. We present a new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions, much like products of experts (Hinton, 2002). The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and Ghahramani, 2006). The IOMM has the desirable properties of being able to focus in on overlapping regions while maintaining the ability to model a potentially infinite number of clusters which may overlap. We derive MCMC inference algorithms for the IOMM and show that these can be used to cluster movies into multiple genres.

#### MCMC for Variationally Sparse Gaussian Processes

James Hensman, Alexander G D G Matthews, Maurizio Filippone, Zoubin Ghahramani, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly.

#### Scalable Variational Gaussian Process Classification

James Hensman, Alexander G D G Matthews, Zoubin Ghahramani, May 2015. (In 18th International Conference on Artificial Intelligence and Statistics). San Diego, California, USA.

Abstract▼ URL

Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, out-performing the state of the art on benchmark datasets. Importantly, the variational formulation an be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.

#### Robust Multi-Class Gaussian Process Classification

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Multi-class Gaussian Processs Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully indentifying data instances which are difficult to classify correctly in practice.

#### Gaussian Process Conditional Copulas with Applications to Financial Time Series

José Miguel Hernández-Lobato, James Robert Lloyds, Daniel Hernández-Lobato, December 2013. (In Advances in Neural Information Processing Systems 27). Lake Tahoe, California, USA.

Abstract▼ URL

The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other time-varying copula methods.

#### Optimally-Weighted Herding is Bayesian Quadrature

Ferenc Huszár, David Duvenaud, July 2012. (In 28th Conference on Uncertainty in Artificial Intelligence). Catalina Island, California.

Abstract▼ URL

Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.

#### Warped Mixtures for Nonparametric Cluster Shapes

Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani, July 2013. (In 29th Conference on Uncertainty in Artificial Intelligence). Bellevue, Washington.

Abstract▼ URL

A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.

#### Unsupervised Many-to-Many Object Matching for Relational Data

Tomoharu Iwata, James Robert Lloyd, Zoubin Ghahramani, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence).

Abstract▼ URL

We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. The effectiveness of the proposed method is experimentally demonstrated by using synthetic and real relational data sets, which include applications to cross-domain recommendation without shared user/item identifiers and multi-lingual word clustering.

#### Message Passing Algorithms for the Dirichlet Diffusion Tree

David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven’t seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

**Comment:** web site

#### Infinite Sparse Factor Analysis and Infinite Independent Components Analysis

David Knowles, Zoubin Ghahramani, September 2007. (In 7th International Conference on Independent Component Analysis and Signal Separation). London, UK. Springer. **DOI**: 10.1007/978-3-540-74494-8_48.

Abstract▼ URL

A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.

#### Pitman-Yor Diffusion Trees

David A. Knowles, Zoubin Ghahramani, 2011. (In 27th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

**Comment:** web site

#### Nonparametric Bayesian Sparse Factor Models with application to Gene Expression modelling.

David A. Knowles, Zoubin Ghahramani, 2011. (Annals of Applied Statistics).

Abstract▼ URL

A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model’s utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

#### Non-conjugate Variational Message Passing for Multinomial and Binary Regression

David A. Knowles, Thomas P. Minka, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.

**Comment:** web site supplementary

#### Approximate inference for Fully Bayesian Gaussian process Regression

Vidhi Lalchand, Carl Edward Rasmussen, 2020. (In 2nd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Learning in Gaussian Process models occurs through the adaptation of hyperparameters of the mean and the covariance function. The classical approach entails maximizing the marginal likelihood yielding fixed point estimates (an approach called Type II maximum likelihood or ML-II). An alternative learning procedure is to infer the posterior over hyper-parameters in a hierarchical specication of GPs we call Fully Bayesian Gaussian Process Regression (GPR). This work considers two approximation schemes for the intractable hyperparameter posterior: 1) Hamiltonian Monte Carlo (HMC) yielding a sampling based approximation and 2) Variational Inference (VI) where the posterior over hyperparameters is approximated by a factorized Gaussian (mean-field) or a full-rank Gaussian accounting for correlations between hyperparameters. We analyse the predictive performance for fully Bayesian GPR on a range of benchmark data sets.

#### Kernel Learning for Explainable Climate Science

Vidhi Lalchand, Kenza Tazi, Talay M Cheema, Richard E Turner, Scott Hosking, 2022. (In 16th Bayesian Modelling Applications Workshop at UAI, 2022).

Abstract▼ URL

The Upper Indus Basin, Himalayas provides water for 270 million people and countless ecosystems. However, precipitation, a key component to hydrological modelling, is poorly understood in this area. A key challenge surrounding this uncertainty comes from the complex spatial-temporal distribution of precipitation across the basin. In this work we propose Gaussian processes with structured non-stationary kernels to model precipitation patterns in the UIB. Previous attempts to quantify or model precipitation in the Hindu Kush Karakoram Himalayan region have often been qualitative or include crude assumptions and simplifications which cannot be resolved at lower resolutions. This body of research also provides little to no error propagation. We account for the spatial variation in precipitation with a non-stationary Gibbs kernel parameterised with an input dependent lengthscale. This allows the posterior function samples to adapt to the varying precipitation patterns inherent in the distinct underlying topography of the Indus region. The input dependent lengthscale is governed by a latent Gaussian process with a stationary squared-exponential kernel to allow the function level hyperparameters to vary smoothly. In ablation experiments we motivate each component of the proposed kernel by demonstrating its ability to model the spatial covariance, temporal structure and joint spatio-temporal reconstruction. We benchmark our model with a stationary Gaussian process and a Deep Gaussian processes.

#### Learning-based Nonlinear Model Predictive Control

Daniel Limon, Jan-Peter Calliess, Jan Maciejowski, July 2017. (In IFAC 2017 World Congress). Toulouse, France. **DOI**: 10.1016/j.ifacol.2017.08.1050.

Abstract▼

This paper presents stabilizing Model Predictive Controllers (MPC) in which prediction models are inferred from experimental data of the inputs and outputs of the plant. Using a nonparametric machine learning technique called LACKI, the estimated (possibly nonlinear) model function together with an estimation of Hoelder constant is provided. Based on these, a number of predictive controllers with stability guaranteed by design are proposed. Firstly, the case when the prediction model is estimated off- line is considered and robust stability and recursive feasibility is ensured by using tightened constraints in the optimisation problem. This controller has been extended to the more interesting and complex case: the online learning of the model, where the new data collected from feedback is added to enhance the prediction model. A on-line learning MPC based on a double sequence of predictions is proposed. Stability of the online learning MPC is proved. These controllers are illustrated by simulation.

#### Representation, learning, description and criticism of probabilistic models with applications to networks, functions and relational data

James Rovert Lloyd, 2015. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

This thesis makes contributions to a variety of aspects of probabilistic inference. When performing probabilistic inference, one must first represent one’s beliefs with a probability distribution. Specifying the details of a probability distribution can be a difficult task in many situations, but when expressing beliefs about complex data structures it may not even be apparent what form such a distribution should take. This thesis starts by demonstrating how representation theorems due to Aldous, Hoover and Kallenberg can be used to specify appropriate models for data in the form of networks. These theorems are then extended in order to reveal appropriate probability distributions for arbitrary relational data or databases. A simpler data structure to specify probability distributions for is that of functions; many probability distributions for functions have been used for centuries. We demonstrate that many of these distributions can be expressed in a common language of Gaussian process kernels constructed from a few base elements and operators. The structure of this language allows for the effective automatic construction of probabilistic models for functions. Furthermore, the formal mathematical language of kernels can be mapped neatly onto natural language allowing for automatic descriptions of the automatically constructed models. By further automating the construction of statistical models, the need to be able to effectively check or criticise these models becomes greater. This thesis demonstrates how kernel two sample tests can be used to demonstrate where a probabilistic model most disagrees with data allowing for targeted improvements to the model. In proposing a new method of model criticism this thesis also briefly discusses the philosophy of model criticism within the context of probabilistic inference.

#### Automatic Construction and Natural-Language Description of Nonparametric Regression Models

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani, July 2014. (In Association for the Advancement of Artificial Intelligence (AAAI)).

Abstract▼ URL

This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.

#### Statistical Model Criticism using Kernel Two Sample Tests

James Robert Lloyd, Zoubin Ghahramani, December 2015. (In Advances in Neural Information Processing Systems 29). Montreal, Canada.

Abstract▼ URL

We propose an exploratory approach to statistical model criticism using maximum mean discrepancy (MMD) two sample tests. Typical approaches to model criticism require a practitioner to select a statistic by which to measure discrepancies between data and a statistical model. MMD two sample tests are instead constructed as an analytic maximisation over a large space of possible statistics and therefore automatically select the statistic which most shows any discrepancy. We demonstrate on synthetic data that the selected statistic, called the witness function, can be used to identify where a statistical model most misrepresents the data it was trained on. We then apply the procedure to real data where the models being assessed are restricted Boltzmann machines, deep belief networks and Gaussian process regression and demonstrate the ways in which these models fail to capture the properties of the data they are trained on.

#### Random function priors for exchangeable arrays with applications to graphs and relational data

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.

#### General Bayesian inference schemes in infinite mixture models

Maria Lomeli, 2017. University College London,Gatsby Unit, London, UK.

Abstract▼ URL

Bayesian statistical models allow us to formalise our knowledge about the world and reason about our uncertainty, but there is a need for better procedures to accurately encode its complexity. One way to do so is through compositional models, which are formed by combining blocks consisting of simpler models. One can increase the complexity of the compositional model by either stacking more blocks or by using a not-so-simple model as a building block. This thesis is an example of the latter. One first aim is to expand the choice of Bayesian nonparametric (BNP) blocks for constructing tractable compositional models. So far, most of the models that have a Bayesian nonparametric component use a Dirichlet Process or a Pitman-Yor process because of the availability of tractable and compact representations. This thesis shows how to overcome certain intractabilities in order to obtain analogous compact representations for the class of Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes. A major impediment to the widespread use of Bayesian nonparametric building blocks is that inference is often costly, intractable or difficult to carry out. This is an active research area since dealing with the model’s infinite dimensional component forbids the direct use of standard simulation-based methods. The main contribution of this thesis is a variety of inference schemes that tackle this problem: Markov chain Monte Carlo and Sequential Monte Carlo methods, which are exact inference schemes since they target the true posterior. The contributions of this thesis, in a larger context, provide general purpose exact inference schemes in the flavour or probabilistic programming: the user is able to choose from a variety of models, focusing only on the modelling part. Indeed, if the wide enough class of Poisson-Kingman priors is used as one of our blocks, this objective is achieved.

#### A hybrid sampler for Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers.

#### A marginal sampler for sigma-Stable Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, 2017. (Journal of Computational and Graphical Statistics).

Abstract▼ URL

We investigate the class of sigma-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs, which encompasses most of the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process, and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterizations of sigma-stable Poisson-Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for performing posterior inference with a Bayesian nonparametric mixture model. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a small number of auxiliary variables per iteration. We apply our sampling scheme to a density estimation and clustering tasks with unidimensional and multidimensional datasets, and compare it against competing MCMC sampling schemes. Supplementary materials for this article are available online.

#### Antithetic and Monte Carlo kernel estimators for partial rankings

Maria Lomeli, Mark Rowland, Arthur Gretton, Zoubin Ghahramani, 2018. (arXiv preprint arXiv:1807.00400).

Abstract▼ URL

In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which prevents the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator for the Mallows kernel. The corresponding antithetic kernel estimator has lower variance and we demonstrate empirically that it has a better performance in a variety of Machine Learning tasks. Both kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. An overview of the existing kernels and metrics for permutations is also provided.

#### Gaussian Process Vine Copulas for Multivariate Dependence

David Lopez-Paz, José Miguel Hernández-Lobato, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

#### Classification using log Gaussian Cox processes

Alexander G. D. G. Matthews, Zoubin Ghahramani, 2014. (arXiv preprint arXiv:1405.4141).

Abstract▼ URL

McCullagh and Yang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.

#### On Sparse Variational methods and the Kullback-Leibler divergence between stochastic processes

Alexander G D G Matthews, James Hensman, Richard E. Turner, Zoubin Ghahramani, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼ URL

The variational framework for learning inducing variables (Titsias, 2009a) has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes.

#### Modelling dyadic data with binary latent factors

E. Meeds, Z. Ghahramani, R. Neal, S.T. Roweis, September 2007. (In Advances in Neural Information Processing Systems 19). Edited by B. Schölkopf, J. Platt, T. Hofmann. Cambridge, MA, USA. The MIT Press. Bradford Books. **Note**: Online contents gives pages 1002–1009, and 977–984 on pdf contents..

Abstract▼ URL

We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data.

#### A Nonparametric Bayesian Model for Multiple Clustering with Overlapping Feature Views

Donglin Niu, Jennifer G. Dy, Z. Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Chinese Restaurant Process. We provide an inference approach to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view.

#### Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Peter Orbanz, 2009. (In Advances in Neural Information Processing Systems 22). Edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta. The MIT Press.

Abstract▼ URL

We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data.

**Comment:** Supplements (proofs) and techreport version

#### An Infinite Latent Attribute Model for Network Data

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a “flat” clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

#### A reversible infinite HMM using normalised random measures

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.

#### A nonparametric variable clustering model

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems.

#### Spectral Diffusion Processes

Angus Phillips, Thomas Seror, Michael Hutchinson, Valentin De Bortoli, Arnaud Doucet, Emile Mathieu, 2022. (In NeurIPS workshop on Score-Based Methods).

Abstract▼ URL

Score-based generative modelling (SGM) has proven to be a very effective method for modelling densities on finite-dimensional spaces. In this work we propose to extend this methodology to learn generative models over functional spaces. To do so, we represent functional data in spectral space to dissociate the stochastic part of the processes from their space-time part. Using dimensionality reduction techniques we then sample from their stochastic component using finite dimensional SGM. We demonstrate our method’s effectiveness for modelling various multimodal datasets.

#### The Supervised IBP: Neighbourhood Preserving Infinite Latent Feature Models

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, Zoubin Ghahramani, July 2013. (In 29th Conference on Uncertainty in Artificial Intelligence). Bellevue, USA.

Abstract▼ URL

We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.

#### The Infinite Gaussian Mixture Model

Carl Edward Rasmussen, 2000. (In Advances in Neural Information Processing Systems 12). Edited by Todd K. Leen Sara A. Solla, Klaus-Robert Müller. The MIT Press.

Abstract▼ URL

In a Bayesian mixture model it is not necessary a priori to limit the number of components to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the “right” number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.

#### Occam's Razor

Carl Edward Rasmussen, Zoubin Ghahramani, December 2001. (In Advances in Neural Information Processing Systems 13). Edited by T. G. Diettrich T. Leen, V. Tresp. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

The Bayesian paradigm apparently only sometimes gives rise to Occam’s Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam’s Razor at work.

#### Infinite Mixtures of Gaussian Process Experts

Carl Edward Rasmussen, Zoubin Ghahramani, December 2002. (In Advances in Neural Information Processing Systems 14). Edited by T. G. Dietterich, S. Becker, Z. Ghahramani. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets — thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

#### Scaling the Indian Buffet Process via Submodular Maximization

Colorado Reed, Zoubin Ghahramani, 2013. (In ICML). JMLR.org. JMLR Proceedings.

Abstract▼ URL

Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)’s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model.

#### On the computability and complexity of Bayesian reasoning

Daniel M. Roy, 2011. (In NIPS Workshop on Philosophy and Machine Learning).

Abstract▼ URL

If we consider the claim made by some cognitive scientists that the mind performs Bayesian reasoning, and if we simultaneously accept the Physical Church-Turing thesis and thus believe that the computational power of the mind is no more than that of a Turing machine, then what limitations are there to the reasoning abilities of the mind? I give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics) and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity of Bayesian reasoning. In particular, we prove that conditional probability is in general not computable in the presence of continuous random variables. However, in light of additional structure in the prior distribution, such as the presence of certain types of noise, or of exchangeability, conditioning is possible. These results cover most of statistical practice. At the workshop on Logic and Computational Complexity, we presented results on the computational complexity of conditioning, embedding sharp-P-complete problems in the task of computing conditional probabilities for diffuse continuous random variables. This work complements older work. For example, under cryptographic assumptions, the computational complexity of producing samples and computing probabilities was separated by Ben-David, Chor, Goldreich and Luby. In recent work, we also make use of cryptographic assumptions to show that different representations of exchangeable sequences may have vastly different complexity. However, when faced with an adversary that is computational bounded, these different representations have the same complexity, highlighting the fact that knowledge representation and approximation play a fundamental role in the possibility and plausibility of Bayesian reasoning.

#### Determinantal Clustering Processes - A Nonparametric Bayesian Approach to Kernel Based Semi-Supervised Clustering

Amar Shah, Zoubin Ghahramani, 2013. (UAI).

Abstract▼ URL

Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the number of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets.

#### Marginalised Gaussian Processes with Nested Sampling

Fergus Simpson, Vidhi Lalchand, Carl Edward Rasmussen, 2021. (In Advances in Neural Information Processing Systems 34). Curran Associates, Inc..

Abstract▼ URL

Gaussian Process models are a rich distribution over functions with inductive biases controlled by a kernel function. Learning occurs through optimisation of the kernel hyperparameters using the marginal likelihood as the objective. This work proposes nested sampling as a means of marginalising kernel hyperparameters, because it is a technique that is well-suited to exploring complex, multi-modal distributions. We benchmark against Hamiltonian Monte Carlo on time-series and two-dimensional regression tasks, finding that a principled approach to quantifying hyperparameter uncertainty substantially improves the quality of prediction intervals.

#### Robust estimation of local genetic ancestry in admixed populations using a non-parametric Bayesian approach

Kyung-Ah Sohn, Zoubin Ghahramani, Eric P. Xing, 2012. (Genetics).

Abstract▼ URL

We present a new haplotype-based approach for inferring local genetic ancestry of individuals in an admixed population. Most existing approaches for local ancestry estimation ignore the latent genetic relatedness between ancestral populations and treat them as independent. In this paper, we exploit such information by building an inheritance model that describes both the ancestral populations and the admixed population jointly in a unified framework. Based on an assumption that the common hypothetical founder haplotypes give rise to both the ancestral and admixed population haplotypes, we employ an infinite hidden Markov model to characterize each ancestral population and further extend it to generate the admixed population. Through an effective utilization of the population structural information under a principled nonparametric Bayesian framework, the resulting model is significantly less sensitive to the choice and the amount of training data for ancestral populations than state-of-the-arts algorithms. We also improve the robustness under deviation from common modeling assumptions by incorporating population-specific scale parameters that allow variable recombination rates in different populations. Our method is applicable to an admixed population from an arbitrary number of ancestral populations and also performs competitively in terms of spurious ancestry proportions under general multi-way admixture assumption. We validate the proposed method by simulation under various admixing scenarios and present empirical analysis results on worldwide distributed dataset from Human Genome Diversity Project.

**Comment:** doi: 10.1534/genetics.112.140228

#### Flexible Martingale Priors for Deep Hierarchies

Jacob Steinhardt, Zoubin Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

When building priors over trees for Bayesian hierarchical models, there is a tension between maintaining desirable theoretical properties such as infinite exchangeability and important practical properties such as the ability to increase the depth of the tree to accommodate new data. We resolve this tension by presenting a family of infinitely exchangeable priors over discrete tree structures that allows the depth of the tree to grow with the data, and then showing that our family contains all hierarchical models with certain mild symmetry properties. We also show that deep hierarchical models are in general intimately tied to a process called a martingale, and use Doob’s martingale convergence theorem to demonstrate some unexpected properties of deep hierarchies.

#### Stick-breaking Construction for the Indian Buffet Process

Yee Whye Teh, Dilan Görür, Zoubin Ghahramani, 2007. (In AISTATS). Edited by Marina Meila, Xiaotong Shen. JMLR.org. JMLR Proceedings.

Abstract▼ URL

The Indian buffet process (IBP) is a Bayesian nonparametric distribution whereby objects are modelled using an unbounded number of latent features. In this paper we derive a stick-breaking representation for the IBP. Based on this new representation, we develop slice samplers for the IBP that are efficient, easy to implement and are more generally applicable than the currently available Gibbs sampler. This representation, along with the work of Thibaux and Jordan, also illuminates interesting theoretical connections between the IBP, Chinese restaurant processes, Beta processes and Dirichlet processes.

#### Bayesian learning of sum-product networks

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, Zoubin Ghahramani, December 2019. (In Advances in Neural Information Processing Systems 33). Vancouver.

Abstract▼ URL

Sum-product networks (SPNs) are flexible density estimators and have received significant attention due to their attractive inference properties. While parameter learning in SPNs is well developed, structure learning leaves something to be desired: Even though there is a plethora of SPN structure learners, most of them are somewhat ad-hoc and based on intuition rather than a clear learning principle. In this paper, we introduce a well-principled Bayesian framework for SPN structure learning. First, we decompose the problem into i) laying out a computational graph, and ii) learning the so-called scope function over the graph. The first is rather unproblematic and akin to neural network architecture validation. The second represents the effective structure of the SPN and needs to respect the usual structural constraints in SPN, i.e. completeness and decomposability. While representing and learning the scope function is somewhat involved in general, in this paper, we propose a natural parametrisation for an important and widely used special case of SPNs. These structural parameters are incorporated into a Bayesian model, such that simultaneous structure and parameter learning is cast into monolithic Bayesian posterior inference. In various experiments, our Bayesian SPNs often improve test likelihoods over greedy SPN learners. Further, since the Bayesian framework protects against overfitting, we can evaluate hyper-parameters directly on the Bayesian model score, waiving the need for a separate validation set, which is especially beneficial in low data regimes. Bayesian SPNs can be applied to heterogeneous domains and can easily be extended to nonparametric formulations. Moreover, our Bayesian approach is the first, which consistently and robustly learns SPN structures under missing data.

#### The infinite HMM for unsupervised PoS tagging

J. Van Gael, A. Vlachos, Z. Ghahramani, August 2009. (In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP)). Singapore. Association for Computational Linguistics. **ISBN**: 978-1-932432-62-6.

Abstract▼ URL

We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations.

#### Unsupervised and constrained Dirichlet process mixture models for verb clustering

A. Vlachos, A Korhonen, Z. Ghahramani, March 2009. (In 4th Workshop on Statistical Machine Translation, EACL '09). Athens, Greece.

Abstract▼ URL

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

#### Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes

Andrew Gordon Wilson, 2014. University of Cambridge, Cambridge, UK.

Abstract▼ URL

Truly intelligent systems are capable of pattern discovery and extrapolation without human intervention. Bayesian nonparametric models, which can uniquely represent expressive prior information and detailed inductive biases, provide a distinct opportunity to develop intelligent systems, with applications in essentially any learning and prediction task. Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. A covariance kernel determines the support and inductive biases of a Gaussian process. In this thesis, we introduce new covariance kernels to enable fast automatic pattern discovery and extrapolation with Gaussian processes. In the introductory chapter, we discuss the high level principles behind all of the models in this thesis: 1) we can typically improve the predictive performance of a model by accounting for additional structure in data; 2) to automatically discover rich structure in data, a model must have large support and the appropriate inductive biases; 3) we most need expressive models for large datasets, which typically provide more information for learning structure, and 4) we can often exploit the existing inductive biases (assumptions) or structure of a model for scalable inference, without the need for simplifying assumptions. In the context of this introduction, we then discuss, in chapter 2, Gaussian processes as kernel machines, and my views on the future of Gaussian process research. In chapter 3 we introduce the Gaussian process regression network (GPRN) framework, a multi-output Gaussian process method which scales to many output variables, and accounts for input-dependent correlations between the outputs. Underlying the GPRN is a highly expressive kernel, formed using an adaptive mixture of latent basis functions in a neural network like architecture. The GPRN is capable of discovering expressive structure in data. We use the GPRN to model the time-varying expression levels of 1000 genes, the spatially varying concentrations of several distinct heavy metals, and multivariate volatility (input dependent noise covariances) between returns on equity indices and currency exchanges, which is particularly valuable for portfolio allocation. We generalise the GPRN to an adaptive network framework, which does not depend on Gaussian processes or Bayesian nonparametrics; and we outline applications for the adaptive network in nuclear magnetic resonance (NMR) spectroscopy, ensemble learning, and change-point modelling. In chapter 4 we introduce simple closed form kernel for automatic pattern discovery and extrapolation. These spectral mixture (SM) kernels are derived by modelling the spectral densiy of a kernel (its Fourier transform) using a scale-location Gaussian mixture. SM kernels form a basis for all stationary covariances, and can be used as a drop-in replacement for standard kernels, as they retain simple and exact learning and inference procedures. We use the SM kernel to discover patterns and perform long range extrapolation on atmospheric CO2 trends and airline passenger data, as well as on synthetic examples. We also show that the SM kernel can be used to automatically reconstruct several standard covariances. The SM kernel and the GPRN are highly complementary; we show that using the SM kernel with adaptive basis functions in a GPRN induces an expressive prior over non-stationary kernels. In chapter 5 we introduce GPatt, a method for fast multidimensional pattern extrapolation, particularly suited to imge and movie data. Without human intervention – no hand crafting of kernel features, and no sophisticated initialisation procedures – we show that GPatt can solve large scale pattern extrapolation, inpainting and kernel discovery problems, including a problem with 383,400 training points. GPatt exploits the structure of a spectral mixture product (SMP) kernel, for fast yet exact inference procedures. We find that GPatt significantly outperforms popular alternative scalable gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits existing model structure are useful in combination for modelling large scale multidimensional patterns. The models in this dissertation have proven to be scalable and with greatly enhanced predictive performance over the alternatives: the extra structure being modelled is an important part of a wide variety of real data – including problems in econometrics, gene expression, geostatistics, nuclear magnetic resonance spectroscopy, ensemble learning, multi-output regression, change point modelling, time series, multivariate volatility, image inpainting, texture extrapolation, video extrapolation, acoustic modelling, and kernel discovery.

#### Gaussian Process Kernels for Pattern Discovery and Extrapolation

Andrew Gordon Wilson, Ryan Prescott Adams, February 18 2013. (In 30th International Conference on Machine Learning).

Abstract▼ URL

Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density – the Fourier transform of a kernel – with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework.

**Comment:** arXiv:1302.4245

#### Copula Processes

Andrew Gordon Wilson, Zoubin Ghahramani, 2010. (In Advances in Neural Information Processing Systems 23). **Note**: Spotlight.

Abstract▼ URL

We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures.

**Comment:** Supplementary Material, slides.

#### Generalised Wishart Processes

Andrew Gordon Wilson, Zoubin Ghahramani, 2011. (In 27th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

We introduce a new stochastic process called the generalised Wishart process (GWP). It is a collection of positive semi-definite random matrices indexed by any arbitrary input variable. We use this process as a prior over dynamic (e.g. time varying) covariance matrices. The GWP captures a diverse class of covariance dynamics, naturally hanles missing data, scales nicely with dimension, has easily interpretable parameters, and can use input variables that include covariates other than time. We describe how to construct the GWP, introduce general procedures for inference and prediction, and show that it outperforms its main competitor, multivariate GARCH, even on financial data that especially suits GARCH.

**Comment:** Supplementary Material, Best Student Paper Award

#### GPatt: Fast Multidimensional Pattern Extrapolation with Gaussian Processes

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, John P Cunningham, 2013. (arXiv preprint arXiv:1310.5288).

Abstract▼ URL

Gaussian processes are typically used for smoothing and interpolation on small datasets. We introduce a new Bayesian nonparametric framework – GPatt – enabling automatic pattern extrapolation with Gaussian processes on large multidimensional datasets. GPatt unifies and extends highly expressive kernels and fast exact inference techniques. Without human intervention – no hand crafting of kernel features, and no sophisticated initialisation procedures – we show that GPatt can solve large scale pattern extrapolation, inpainting, and kernel discovery problems, including a problem with 383,400 training points. We find that GPatt significantly outperforms popular alternative scalable Gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits model structure are useful in combination for modelling large scale multidimensional patterns.

#### Gaussian Process Regression Networks

Andrew Gordon Wilson, David A Knowles, Zoubin Ghahramani, October 19 2011. Department of Engineering, University of Cambridge, Cambridge, UK.

Abstract▼ URL

We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

**Comment:** arXiv:1110.4411

#### Dependent Indian buffet processes

Sinead Williamson, Peter Orbanz, Zoubin Ghahramani, May 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Chia Laguna, Sardinia, Italy. W & CP.

Abstract▼ URL

Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.

#### The IBP compound Dirichlet process and its application to focused topic modeling

Sinead Williamson, Katherine A. Heller, C. Wang, D. M. Blei, June 2010. (In 27th International Conference on Machine Learning). Haifa, Israel.

Abstract▼ URL

The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model — each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point’s “focus”, is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.

#### Tree-based inference for Dirichlet process mixtures

Yang Xu, Katherine A. Heller, Zoubin Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Microtome Publishing (paper), Journal of Machine Learning Research (online). **Note**: ISSN 1938-7228.

Abstract▼ URL

The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density estimation. Unfortunately, like in many statistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational Bayesian (VB) [4] and collapsed variational methods [5], [6] recently introduced a novel class of approximation for DPMs based on Bayesian hierarchical clustering (BHC). These tree-based combinatorial approximations efficiently sum over exponentially many ways of partitioning the data and offer a novel lower bound on the marginal likelihood of the DPM [6]. In this paper we make the following contributions: (1) We show empirically that the BHC lower bounds are substantially tighter than the bounds given by VB [4] and by collapsed variational methods [5] on synthetic and real datasets. (2) We also show that BHC offers a more accurate predictive performance on these datasets. (3) We further improve the tree-based lower bounds with an algorithm that efficiently sums contributions from alternative trees. (4) We present a fast approximate method for BHC. Our results suggest that our combinatorial approximate inference methods and lower bounds may be useful not only in DPMs but in other models as well.