## Publications

#### Large Scale Non-parametric Inference: Data Parallelisation in the Indian Buffet Process

Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani, December 2009. (In Advances in Neural Information Processing Systems 23). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

#### No Correlation Between Childhood Maltreatment and Telomere Length.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, Tim D. Spector, 2010. (Biological Psychiatry).

Abstract▼

Telomeres are lengths of repetitive DNA that cap the ends of chromosomes. They protect the ends of the chromosome and shorten with each cell division. Short leukocyte telomere length has been related to a number of age-related diseases. In addition, shorter telomere length has been associated with environmental factors such as smoking and lack of exercise. In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a report suggesting a link between maltreatment in childhood and telomere shortening in 31 subjects. Individuals who had suffered maltreatment had telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had not been abused.

#### Beta diffusion trees

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

#### Beta diffusion trees and hierarchical feature allocations

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, August 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

#### Message Passing Algorithms for the Dirichlet Diffusion Tree

David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven’t seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

**Comment:** web site

#### Infinite Sparse Factor Analysis and Infinite Independent Components Analysis

David Knowles, Zoubin Ghahramani, September 2007. (In 7th International Conference on Independent Component Analysis and Signal Separation). London, UK. Springer. **DOI**: 10.1007/978-3-540-74494-8_48.

Abstract▼ URL

A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.

#### Pitman-Yor Diffusion Trees

David A. Knowles, Zoubin Ghahramani, 2011. (In 27th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

**Comment:** web site

#### Nonparametric Bayesian Sparse Factor Models with application to Gene Expression modelling.

David A. Knowles, Zoubin Ghahramani, 2011. (Annals of Applied Statistics).

Abstract▼ URL

A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model’s utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

#### Statistical tools for ultra-deep pyrosequencing of fast evolving viruses

David A. Knowles, Susan Holmes, 2009. (In NIPS Workshop: Computational Biology).

Abstract▼ URL

We aim to detect minor variant Hepatitis B viruses (HBV) in 38 pyrosequencing samples from infected individuals. Errors involved in the amplification and ultra deep pyrosequencing (UDPS) of these samples are characterised using HBV plasmid controls. Homopolymeric regions and quality scores are found to be significant covariates in determining insertion and deletion (indel) error rates, but not mismatch rates which depend on the nucleotide transition matrix. This knowledge is used to derive two methods for classifying genuine mutations: a hypothesis testing framework and a mixture model. Using an approximate “ground truth” from a limiting dilution Sanger sequencing run, these methods are shown to outperform the naive percentage threshold approach. The possibility of early stage PCR errors becoming significant is investigated by simulation, which underlines the importance of the initial copy number.

**Comment:** web site

#### Non-conjugate Variational Message Passing for Multinomial and Binary Regression

David A. Knowles, Thomas P. Minka, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.

**Comment:** web site supplementary

#### Modeling skin and ageing phenotypes using latent variable models in Infer.NET

David A. Knowles, Leopold Parts, Daniel Glass, John M. Winn, 2010. (In NIPS Workshop: Predictive Models in Personalized Medicine Workshop).

Abstract▼ URL

We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study. We address various data modeling problems include high missingness, heterogeneous data, and repeat observations. We compare the proposed models in terms of their performance at predicting disease labels and symptoms from available explanatory variables, concluding that factor analysis type models have the strongest statistical performance in this setting. We show that such models can be combined with regression components for improved interpretability.

**Comment:** web site

#### Distinct epigenomic features in human cardiomyopathy

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider, Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin Goddard, Pietro Lio, Martin Bennett, Roger Foo, 2011. (Circulation, American Heart Association).

Abstract▼ URL

Background. The epigenome refers to marks on the genome including DNA methylation and histone modifications that regulate the expression of underlying genes. A consistent profile of gene expression changes in end- stage cardiomyopathy led us to hypothesise that distinct global patterns of the epigenome may also exist. Methods and Results. We constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36 tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human hearts. 506Mb of sequence per library was generated by high-throughput sequencing, covering 24 million out of the 28 million CG di-nucleotides in the human genome. DNA methylation was significantly different in promoter CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions of the genome. Moreover DNA methylation differences were present in promoters of upregulated genes but not down-regulated genes. The profile of H3K36me3-enrichment itself was also significantly different in protein-coding regions of the genome. Conclusions. Distinct epigenomic patterns exist in important DNA elements of the human cardiac genome in end-stage cardiomyopathy. If epigenomic patterns track with disease progression, assays for the epigenome may be more useful than quantification of mRNA for assessing prognosis in heart failure. These results open up an important new horizon of research and further studies will be needed to determine how epigenomics contribute to altered gene expression in cardiomyopathy.

#### An Infinite Latent Attribute Model for Network Data

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a “flat” clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

#### A reversible infinite HMM using normalised random measures

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.

#### A nonparametric variable clustering model

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems.

#### The Supervised IBP: Neighbourhood Preserving Infinite Latent Feature Models

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, Zoubin Ghahramani, July 2013. (In 29th Conference on Uncertainty in Artificial Intelligence). Bellevue, USA.

Abstract▼ URL

We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.

#### Dichotomous cellular properties of mouse orexin/hypocretin neurons

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, Denis Burdakov, 2011. (The Journal of Physiology).

Abstract▼ URL

Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as critical regulators of sleep-wake cycles, reward-seeking, and body energy balance. However, at the level of cellular and network properties, it remains unclear whether hcrt/orx neurons are one homogenous population, or whether there are several distinct types of hcrt/orx cells. Here, we collated diverse structural and functional information about individual hcrt/orx neurons in mouse brain slices, by combining patch-clamp analysis of spike firing, membrane currents, and synaptic inputs with confocal imaging of cell shape and subsequent 3-dimensional Sholl analysis of dendritic architecture. Statistical cluster analysis of intrinsic firing properties revealed that hcrt/orx neurons fall into two distinct types. These two cell types also differ in the complexity of their dendritic arbour, the strength of AMPA and GABAA receptor-mediated synaptic drive that they receive, and the density of low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results provide quantitative evidence that, at the cellular level, the mouse hcrt/orx system is composed of two classes of neurons with different firing properties, morphologies, and synaptic input organization.

#### Gaussian Process Regression Networks

Andrew Gordon Wilson, David A Knowles, Zoubin Ghahramani, October 19 2011. Department of Engineering, University of Cambridge, Cambridge, UK.

Abstract▼ URL

We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

**Comment:** arXiv:1110.4411

#### Gaussian Process Regression Networks

Andrew Gordon Wilson, David A. Knowles, Zoubin Ghahramani, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.