# Clustering

Clustering algorithms are unsupervised methods for finding groups of similar points in data. They are closely related to statistical mixture models.

#### Testing a Bayesian Measure of Representativeness Using a Large Image Database

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, Thomas L. Griffiths, 2011. (In Advances in Neural Information Processing Systems 24). Cambridge, MA, USA. The MIT Press.

Abstract▼

How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to deﬁne a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories.

#### Tree-Structured Stick Breaking for Hierarchical Data

R. P. Adams, Zoubin Ghahramani, Michael I. Jordan, 2010. (In Advances in Neural Information Processing Systems 23). The MIT Press.

Abstract▼ URL

Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.

#### Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, David L. Wild, 2004. (In Pacific Symposium on Biocomputing 2004). (Pacific Symposium on Biocomputing 2004; Vol. 9). Singapore. The Big Island of Hawaii. World Scientific Publishing.

Abstract▼ URL

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

#### On a class of sigma-Stable Poisson-Kingman models and an effective marginalised sampler

Stefano Favaro, Maria Lomeli, Yee Whye Teh, 2015. (Statistics and Computing).

Abstract▼ URL

We investigate the use of a large class of discrete random probability measures, which is referred to as the class Q, , in the context of Bayesian nonparametric mixture modeling. The class Q encompasses both the the two-parameter Poisson?Dirichlet process and the normalized generalized Gamma process, thus allowing us to comparatively study the inferential advantages of these two well-known nonparametric priors. Apart from ahighly flexible parameterization, the distinguishing feature of the class Q is the availability of a tractable posterior distribution. This feature, in turn, leads to derive an efficient marginal MCMC algorithm for posterior sampling within the framework of mixture models. We demonstrate the efficacy of our modeling framework on both one-dimensional and multi-dimensional datasets.

#### Factorial Learning and the EM Algorithm

Zoubin Ghahramani, 1994. (In Advances in Neural Information Processing Systems 7). Edited by Gerald Tesauro, David S. Touretzky, Todd K. Leen. MIT Press.

Abstract▼ URL

Many real world learning problems are best characterized by an interaction of multiple independent causes or factors. Discovering such causal structure from the data is the focus of this paper. Based on Zemel and Hinton’s cooperative vector quantizer (CVQ) architecture, an unsupervised learning algorithm is derived from the Expectation–Maximization (EM) framework. Due to the combinatorial nature of the data generation process, the exact E-step is computationally intractable. Two alternative methods for computing the E-step are proposed: Gibbs sampling and mean-field approximation, and some promising empirical results are presented.

#### Bayesian Sets

Zoubin Ghahramani, Katherine A. Heller, December 2006. (In Advances in Neural Information Processing Systems 18). Edited by Y. Weiss, B. Schölkopf, J. Platt. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a model-based concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions.

#### Supervised learning from incomplete data via an EM approach

Zoubin Ghahramani, Michael I. Jordan, 1993. (In NIPS). Edited by Jack D. Cowan, Gerald Tesauro, Joshua Alspector. Morgan Kaufmann. **ISBN**: 1-55860-322-0.

Abstract▼ URL

Real-world learning tasks may involve high-dimensional data sets with arbitrary patterns of missing data. In this paper we present a framework based on maximum likelihood density estimation for learning from such data sets. We use mixture models for the density estimates and make two distinct appeals to the ExpectationMaximization (EM) principle (Dempster et al., 1977) in deriving a learning algorithm—EM is used both for the estimation of mixture components and for coping with missing data. The resulting algorithm is applicable to a wide range of supervised as well as unsupervised learning problems. Results from a classification benchmark—the iris data set—are presented.

#### Variational inference for nonparametric multiple clustering

Y. Guan, J. G. Dy, D. Niu, Z. Ghahramani, July 2010. (In KDD10 Workshop on Discovering, Summarizing, and Using Multiple Clusterings). Washington, DC, USA.

Abstract▼ URL

Most clustering algorithms produce a single clustering solution. Similarly, feature selection for clustering tries to find one feature subset where one interesting clustering solution resides. However, a single data set may be multi-faceted and can be grouped and interpreted in many different ways, especially for high dimensional data, where feature selection is typically needed. Moreover, different clustering solutions are interesting for different purposes. Instead of committing to one clustering solution, in this paper we introduce a probabilistic nonparametric Bayesian model that can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. We provide a variational inference approach to learn the features and clustering partitions in each view. Our model allows us not only to learn the multiple clusterings and views but also allows us to automatically learn the number of views and the number of clusters in each view.

#### Beta diffusion trees

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

#### Beta diffusion trees and hierarchical feature allocations

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, August 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

#### The combinatorial structure of beta negative binomial processes

Creighton Heaukulani, Daniel M. Roy, March 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We characterize the combinatorial structure of conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure. In Bayesian nonparametric applications, such processes have served as models for unknown multisets of a measurable space. Previous work has characterized random subsets arising from conditionally-i.i.d. sequences of Bernoulli processes with a common beta process base measure. In this case, the combinatorial structure is described by the Indian buffet process. Our results give a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide constructions for the beta negative binomial process that avoid a representation of the underlying beta process base measure.

#### Bayesian hierarchical clustering

Katherine A. Heller, Zoubin Ghahramani, 2005. (In ICML). Edited by Luc De Raedt, Stefan Wrobel. Association for Computing Machinery. ACM International Conference Proceeding Series. **ISBN**: 1-59593-180-5.

Abstract▼ URL

We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distance-based agglomerative clustering algorithms. (1) It defines a probabilistic model of the data which can be used to compute the predictive distribution of a test point and the probability of it belonging to any of the existing clusters in the tree. (2) It uses a model-based criterion to decide on merging clusters rather than an ad-hoc distance metric. (3) Bayesian hypothesis testing is used to decide which merges are advantageous and to output the recommended depth of the tree. (4) The algorithm can be interpreted as a novel fast bottom-up approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). It provides a new lower bound on the marginal likelihood of a DPM by summing over exponentially many clusterings of the data in polynomial time. We describe procedures for learning the model hyperpa-rameters, computing the predictive distribution, and extensions to the algorithm. Experimental results on synthetic and real-world data sets demonstrate useful properties of the algorithm.

#### A Nonparametric Bayesian Approach to Modeling Overlapping Clusters

Katherine A. Heller, Zoubin Ghahramani, 2007. (In AISTATS). Edited by Marina Meila, Xiaotong Shen. JMLR.org. JMLR Proceedings.

Abstract▼ URL

Although clustering data into mutually exclusive partitions has been an extremely successful approach to unsupervised learning, there are many situations in which a richer model is needed to fully represent the data. This is the case in problems where data points actually simultaneously belong to multiple, overlapping clusters. For example a particular gene may have several functions, therefore belonging to several distinct clusters of genes, and a biologist may want to discover these through unsupervised modeling of gene expression data. We present a new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions, much like products of experts (Hinton, 2002). The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and Ghahramani, 2006). The IOMM has the desirable properties of being able to focus in on overlapping regions while maintaining the ability to model a potentially infinite number of clusters which may overlap. We derive MCMC inference algorithms for the IOMM and show that these can be used to cluster movies into multiple genres.

#### Statistical models for partial membership

Katherine A. Heller, Sinead Williamson, Zoubin Ghahramani, July 2008. (In 25th International Conference on Machine Learning). Edited by Andrew McCallum, Sam Roweis. Helsinki, Finland. Omnipress.

Abstract▼ URL

We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data (Gasch & Eisen, 2002). Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

#### Robust Multi-Class Gaussian Process Classification

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Multi-class Gaussian Processs Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully indentifying data instances which are difficult to classify correctly in practice.

#### Warped Mixtures for Nonparametric Cluster Shapes

Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani, July 2013. (In 29th Conference on Uncertainty in Artificial Intelligence). Bellevue, Washington.

Abstract▼ URL

A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.

#### Unsupervised Many-to-Many Object Matching for Relational Data

Tomoharu Iwata, James Robert Lloyd, Zoubin Ghahramani, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence).

Abstract▼ URL

We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. The effectiveness of the proposed method is experimentally demonstrated by using synthetic and real relational data sets, which include applications to cross-domain recommendation without shared user/item identifiers and multi-lingual word clustering.

#### Pitman-Yor Diffusion Trees

David A. Knowles, Zoubin Ghahramani, 2011. (In 27th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

**Comment:** web site

#### Non-conjugate Variational Message Passing for Multinomial and Binary Regression

David A. Knowles, Thomas P. Minka, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.

**Comment:** web site supplementary

#### Generalised GPLVM with Stochastic Variational Inference

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence, 28–30 Mar 2022. (In 25th International Conference on Artificial Intelligence and Statistics). PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Gaussian process latent variable models (GPLVM) are a flexible and non-linear approach to dimensionality reduction, extending classical Gaussian processes to an unsupervised learning context. The Bayesian incarnation of the GPLVM uses a variational framework, where the posterior over latent variables is approximated by a well-behaved variational family, a factorised Gaussian yielding a tractable lower bound. However, the non-factorisability of the lower bound prevents truly scalable inference. In this work, we study the doubly stochastic formulation of the Bayesian GPLVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. Further, we demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions. We demonstrate the model’s performance by benchmarking against the canonical sparse GPLVM for high dimensional data examples.

#### General Bayesian inference schemes in infinite mixture models

Maria Lomeli, 2017. University College London,Gatsby Unit, London, UK.

Abstract▼ URL

Bayesian statistical models allow us to formalise our knowledge about the world and reason about our uncertainty, but there is a need for better procedures to accurately encode its complexity. One way to do so is through compositional models, which are formed by combining blocks consisting of simpler models. One can increase the complexity of the compositional model by either stacking more blocks or by using a not-so-simple model as a building block. This thesis is an example of the latter. One first aim is to expand the choice of Bayesian nonparametric (BNP) blocks for constructing tractable compositional models. So far, most of the models that have a Bayesian nonparametric component use a Dirichlet Process or a Pitman-Yor process because of the availability of tractable and compact representations. This thesis shows how to overcome certain intractabilities in order to obtain analogous compact representations for the class of Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes. A major impediment to the widespread use of Bayesian nonparametric building blocks is that inference is often costly, intractable or difficult to carry out. This is an active research area since dealing with the model’s infinite dimensional component forbids the direct use of standard simulation-based methods. The main contribution of this thesis is a variety of inference schemes that tackle this problem: Markov chain Monte Carlo and Sequential Monte Carlo methods, which are exact inference schemes since they target the true posterior. The contributions of this thesis, in a larger context, provide general purpose exact inference schemes in the flavour or probabilistic programming: the user is able to choose from a variety of models, focusing only on the modelling part. Indeed, if the wide enough class of Poisson-Kingman priors is used as one of our blocks, this objective is achieved.

#### A hybrid sampler for Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers.

#### A marginal sampler for sigma-Stable Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, 2017. (Journal of Computational and Graphical Statistics).

Abstract▼ URL

We investigate the class of sigma-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs, which encompasses most of the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process, and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterizations of sigma-stable Poisson-Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for performing posterior inference with a Bayesian nonparametric mixture model. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a small number of auxiliary variables per iteration. We apply our sampling scheme to a density estimation and clustering tasks with unidimensional and multidimensional datasets, and compare it against competing MCMC sampling schemes. Supplementary materials for this article are available online.

#### A Nonparametric Bayesian Model for Multiple Clustering with Overlapping Feature Views

Donglin Niu, Jennifer G. Dy, Z. Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Chinese Restaurant Process. We provide an inference approach to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view.

#### Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, David L. Wild, 2009. (IEEE/ACM Transactions on Computational Biology and Bioinformatics). **DOI**: 10.1109/TCBB.2007.70269. **ISSN**: 1545-5963.

Abstract▼ URL

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

#### R/BHC: fast Bayesian hierarchical clustering for microarray data

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant, K. Denby, D. L. Wild, August 2009. (BMC Bioinformatics 2009). BioMed Central. **DOI**: 10.1186/1471-2105-10-242. **ISSN**: 1471-2105. **PubMed ID**: 19660130.

Abstract▼ URL

Background: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained. Results: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge. Conclusion: Biologically plausible results are presented from a well studied data set: expression profiles of *A. thaliana* subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.

#### Determinantal Clustering Processes - A Nonparametric Bayesian Approach to Kernel Based Semi-Supervised Clustering

Amar Shah, Zoubin Ghahramani, 2013. (UAI).

Abstract▼ URL

Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the number of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets.

#### SMEM Algorithm for Mixture Models

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, Geoffrey E. Hinton, 2000. (Neural Computation).

Abstract▼ URL

We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split-and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems.

#### Split and Merge EM Algorithm for Improving Gaussian Mixture Density Estimates

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, Geoffrey E. Hinton, 2000. (VLSI Signal Processing).

Abstract▼ URL

We present a split and merge EM algorithm to overcome the local maximum problem in Gaussian mixture density estimation. Nonglobal maxims often involve having too many Gaussians in one part of the space and too few in another, widely separated part of the space. To escape from such configurations we repeatedly perform split and merge operations using a new criterion for efficiently selecting the split and merge candidates. Experimental results on synthetic and real data show the effectiveness of using the split and merge operations to improve the likelihood of both the training data and of held-out test data

#### SMEM Algorithm for Mixture Models

Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, Geoffrey E. Hinton, 1998. (In NIPS). Edited by Michael J. Kearns, Sara A. Solla, David A. Cohn. The MIT Press. **ISBN**: 0-262-11245-0.

Abstract▼ URL

We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split- and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems.

#### Sum-Product Autoencoding: Encoding and Decoding Representations using Sum-Product Networks

Antonio Vergari, Robert Peharz, Nicola Di Mauro, Alejandro Molina, Kristian Kersting, Floriana Esposito, February 2018. (In 32nd AAAI Conference on Artificial Intelligence). New Orleans, USA.

Abstract▼

Abstract Sum-Product Networks (SPNs) are a deep probabilistic architecture that up to now has been successfully employed for tractable inference. Here, we extend their scope towards unsupervised representation learning: we encode samples into continuous and categorical embeddings and show that they can also be decoded back into the original input space by leveraging MPE inference. We characterize when this Sum-Product Autoencoding (SPAE) leads to equivalent reconstructions and extend it towards dealing with missing embedding information. Our experimental results on several multilabel classification problems demonstrate that SPAE is competitive with state-of-the-art autoencoder architectures, even if the SPNs were never trained to reconstruct their inputs.

#### Active Learning for Constrained Dirichlet Process Mixture Models

Andreas Vlachos, Zoubin Ghahramani, Ted Briscoe, 2010. (In Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics). Uppsala, Sweden.

Abstract▼ URL

Recent work applied Dirichlet Process Mixture Models to the task of verb clustering, incorporating supervision in the form of must-links and cannot-links constraints between instances. In this work, we introduce an active learning approach for constraint selection employing uncertainty-based sampling. We achieve substantial improvements over random selection on two datasets.

#### Dirichlet process mixture models for verb clustering

A. Vlachos, Z. Ghahramani, A Korhonen, July 2008. (In ICML Workshop on Prior Knowledge for Text and Language Processing). Edited by Guillaume Bouchard, Hal Daumé III, Marc Dymetman, Yee Whye Teh. Helsinki, Finland.

Abstract▼ URL

In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin’s (1993) verb classes using the recently introduced V-measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.

#### Unsupervised and constrained Dirichlet process mixture models for verb clustering

A. Vlachos, A Korhonen, Z. Ghahramani, March 2009. (In 4th Workshop on Statistical Machine Translation, EACL '09). Athens, Greece.

Abstract▼ URL

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

#### Distributed Inference for Dirichlet Process Mixture Models

Hong Ge, Yutian Chen, Moquan Wan, Zoubin Ghahramani, 07–09 Jul 2015. (In Proceedings of the 32nd International Conference on Machine Learning). Edited by Francis Bach, David Blei. Lille, France. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Bayesian nonparametric mixture models based on the Dirichlet process (DP) have been widely used for solving problems like clustering, density estimation and topic modelling. These models make weak assumptions about the underlying process that generated the observed data. Thus, when more data are collected, the complexity of these models can change accordingly. These theoretical properties often lead to superior predictive performance when compared to traditional finite mixture models. However, despite the increasing amount of data available, the application of Bayesian nonparametric mixture models is so far limited to relatively small data sets. In this paper, we propose an efficient distributed inference algorithm for the DP and the HDP mixture model. The proposed method is based on a variant of the slice sampler for DPs. Since this sampler does not involve a pre-determined truncation, the stationary distribution of the sampling algorithm is unbiased. We provide both local thread-level and distributed machine-level parallel implementations and study the performance of this sampler through an extensive set of experiments on image and text data. When compared to existing inference algorithms, the proposed method exhibits state-of-the-art accuracy and strong scalability with up to 512 cores.