## Publications

#### Statistical models for partial membership

Katherine A. Heller, Sinead Williamson, Zoubin Ghahramani, July 2008. (In 25th International Conference on Machine Learning). Edited by Andrew McCallum, Sam Roweis. Helsinki, Finland. Omnipress.

Abstract▼ URL

We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data (Gasch & Eisen, 2002). Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

#### Probabilistic Models for Data Combination in Recommender Systems

Sinead Williamson, Zoubin Ghahramani, 2008. (In Learning from Multiple Sources Workshop, NIPS Conference). Whistler Canada.

#### Dependent Indian buffet processes

Sinead Williamson, Peter Orbanz, Zoubin Ghahramani, May 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Chia Laguna, Sardinia, Italy. W & CP.

Abstract▼ URL

Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.

#### The IBP compound Dirichlet process and its application to focused topic modeling

Sinead Williamson, Katherine A. Heller, C. Wang, D. M. Blei, June 2010. (In 27th International Conference on Machine Learning). Haifa, Israel.

Abstract▼ URL

The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model — each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point’s “focus”, is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.