Search Contact information
University of Cambridge Home Department of Engineering
Computational and Biological Learning Lab
University of Cambridge >  Department of Engineering >  Information Engineering  >  Computational and Biological Learning Lab  > David Knowles

David A. Knowles

Publications

Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani. Gaussian process regression networks. Technical Report arXiv:1110.4411 [stat.ML], Department of Engineering, University of Cambridge, Cambridge, UK, October 19 2011.

Abstract: We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

Comment: arXiv:1110.4411

David A. Knowles and Zoubin Ghahramani. Nonparametric Bayesian sparse factor models with application to gene expression modelling.. Annals of Applied Statistics, 5(2B):1534-1552, 2011.

Abstract: A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model's utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

David A. Knowles and Zoubin Ghahramani. Pitman-Yor diffusion trees. In 27nd Conference on Uncertainty in Artificial Intelligence, 2011.

Abstract: We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

Comment: web site

David A. Knowles, Jurgen Van Gael, and Zoubin Ghahramani. Message passing algorithms for the Dirichlet diffusion tree. In 28th International Conference on Machine Learning, 2011.

Abstract: We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

Comment: web site

David A. Knowles and Thomas P. Minka. Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems 25, 2011.

Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.

Comment: web site supplementary

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn. Inferring a measure of physiological age from multiple ageing related phenotypes. In NIPS Workshop: From Statistical Genetics to Predictive Models in Personalized Medicine, 2011.

Abstract: What is ageing? One definition is simultaneous degradation of multiple organ systems. Can an individual be said to be "old" or "young" for their (chronological) age in a scientifically meaningful way? We investigate these questions using ageing related phenotypes measured on the 12,000 female twins in the Twins UK study. We propose a simple linear model of ageing, which allows a latent adjustment to be made to an individual's chronological age to give her "physiological age", shared across the observed phenotypes. We note problems with the analysis resulting from the linearity assumption and show how to alleviate these issues using a non-linear extension. We find more gene expression probes are significantly associated with our measurement of physiological age than to chronological age.

Comment: web site

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider, Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin Goddard, Pietro Lio, Martin Bennett, and Roger Foo. Distinct epigenomic features in human cardiomyopathy. Circulation, American Heart Association, 2011.

Abstract: Background. The epigenome refers to marks on the genome including DNA methylation and histone modifications that regulate the expression of underlying genes. A consistent profile of gene expression changes in end- stage cardiomyopathy led us to hypothesise that distinct global patterns of the epigenome may also exist. Methods and Results. We constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36 tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human hearts. 506Mb of sequence per library was generated by high-throughput sequencing, covering 24 million out of the 28 million CG di-nucleotides in the human genome. DNA methylation was significantly different in promoter CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions of the genome. Moreover DNA methylation differences were present in promoters of upregulated genes but not down-regulated genes. The profile of H3K36me3-enrichment itself was also significantly different in protein-coding regions of the genome. Conclusions. Distinct epigenomic patterns exist in important DNA elements of the human cardiac genome in end-stage cardiomyopathy. If epigenomic patterns track with disease progression, assays for the epigenome may be more useful than quantification of mRNA for assessing prognosis in heart failure. These results open up an important new horizon of research and further studies will be needed to determine how epigenomics contribute to altered gene expression in cardiomyopathy.

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, and Denis Burdakov. Dichotomous cellular properties of mouse orexin/hypocretin neurons. The Journal of Physiology, 2011.

Abstract: Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as critical regulators of sleep-wake cycles, reward-seeking, and body energy balance. However, at the level of cellular and network properties, it remains unclear whether hcrt/orx neurons are one homogenous population, or whether there are several distinct types of hcrt/orx cells. Here, we collated diverse structural and functional information about individual hcrt/orx neurons in mouse brain slices, by combining patch-clamp analysis of spike firing, membrane currents, and synaptic inputs with confocal imaging of cell shape and subsequent 3-dimensional Sholl analysis of dendritic architecture. Statistical cluster analysis of intrinsic firing properties revealed that hcrt/orx neurons fall into two distinct types. These two cell types also differ in the complexity of their dendritic arbour, the strength of AMPA and GABAA receptor-mediated synaptic drive that they receive, and the density of low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results provide quantitative evidence that, at the cellular level, the mouse hcrt/orx system is composed of two classes of neurons with different firing properties, morphologies, and synaptic input organization.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, , and Tim D. Spector. No correlation between childhood maltreatment and telomere length.. Biological Psychiatry, 68(6):21-22, 2010.

Abstract: Telomeres are lengths of repetitive DNA that cap the ends of chromosomes. They protect the ends of the chromosome and shorten with each cell division. Short leukocyte telomere length has been related to a number of age-related diseases. In addition, shorter telomere length has been associated with environmental factors such as smoking and lack of exercise. In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a report suggesting a link between maltreatment in childhood and telomere shortening in 31 subjects. Individuals who had suffered maltreatment had telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had not been abused.

David A. Knowles, Leopold Parts, Daniel Glass, and John M. Winn. Modeling skin and ageing phenotypes using latent variable models in infer.net. In NIPS Workshop: Predictive Models in Personalized Medicine Workshop, 2010.

Abstract: We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study. We address various data modeling problems include high missingness, heterogeneous data, and repeat observations. We compare the proposed models in terms of their performance at predicting disease labels and symptoms from available explanatory variables, concluding that factor analysis type models have the strongest statistical performance in this setting. We show that such models can be combined with regression components for improved interpretability.

Comment: web site

Finale Doshi-Velez, David Knowles, Shakir Mohamed, and Zoubin Ghahramani. Large scale non-parametric inference: Data parallelisation in the Indian buffet process. In Advances in Neural Information Processing Systems 23, pages 1294-1302, Cambridge, MA, USA, December 2009. The MIT Press.

Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

David A. Knowles and Susan Holmes. Statistical tools for ultra-deep pyrosequencing of fast evolving viruses. In NIPS Workshop: Computational Biology, 2009.

Abstract: We aim to detect minor variant Hepatitis B viruses (HBV) in 38 pyrosequencing samples from infected individuals. Errors involved in the amplification and ultra deep pyrosequencing (UDPS) of these samples are characterised using HBV plasmid controls. Homopolymeric regions and quality scores are found to be significant covariates in determining insertion and deletion (indel) error rates, but not mismatch rates which depend on the nucleotide transition matrix. This knowledge is used to derive two methods for classifying genuine mutations: a hypothesis testing framework and a mixture model. Using an approximate "ground truth" from a limiting dilution Sanger sequencing run, these methods are shown to outperform the naive percentage threshold approach. The possibility of early stage PCR errors becoming significant is investigated by simulation, which underlines the importance of the initial copy number.

Comment: web site

David Knowles and Zoubin Ghahramani. Infinite sparse factor analysis and infinite independent components analysis. In 7th International Conference on Independent Component Analysis and Signal Separation, pages 381-388, London, UK, September 2007. Springer, doi 10.1007/978-3-540-74494-8_48.

Abstract: A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.

Reports

Contact details

E-mail: dak33 at cam dot ac dot uk
Mail: Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, UK

© University of Cambridge, Department of Engineering
Information provided by David Knowles (dak33)