Bioinformatics

Recent advances in biology have allowed us to collect vast amounts of genetic, proteomic and biomedical data. While this data offers the potential to help us understand the building blocks of life, and to revolutionise medicine, analysing and understanding it poses immense computational and statistical challenges. Our work in Bionformatics includes modelling protein secondary and tertiary structure, analysis of gene microarray data, protein-protein interactions, and biomarker discovery.

A Bayesian approach to reconstructing genetic regulatory networks with hidden factors

Matthew J. Beal, Francesco Falciani, Zoubin Ghahramani, Claudia Rangel, David L. Wild, 2005. (Bioinformatics).

Abstract▼ URL

Motivation: We have used state-space models (SSMs) to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T cell activation. SSMs are a class of dynamic Bayesian networks in which the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects that cannot be directly measured in a gene expression profiling experiment, for example: genes that have not been included in the microarray, levels of regulatory proteins, the effects of mRNA and protein degradation, etc. Results: We have approached the problem of inferring the model structure of these state-space models using both classical and Bayesian methods. In our previous work, a bootstrap procedure was used to derive classical confidence intervals for parameters representing `gene–gene’ interactions over time. In this article, variational approximations are used to perform the analogous model selection task in the Bayesian context. Certain interactions are present in both the classical and the Bayesian analyses of these regulatory networks. The resulting models place JunB and JunD at the centre of the mechanisms that control apoptosis and proliferation. These mechanisms are key for clonal expansion and for controlling the long term behavior (e.g. programmed cell death) of these cells.

Equilibrium simulations of proteins using molecular fragment replacement and NMR chemical shifts

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas Hamelryck, Kresten Lindorff-Larsen, Michele Vendruscolo, 2014. (Proceedings of the National Academy of Sciences). DOI: 10.1073/pnas.1404948111.

Abstract▼

Methods of protein structure determination based on NMR chemical shifts are becoming increasingly common. The most widely used approaches adopt the molecular fragment replacement strategy, in which structural fragments are repeatedly reassembled into different complete conformations in molecular simulations. Although these approaches are effective in generating individual structures consistent with the chemical shift data, they do not enable the sampling of the conformational space of proteins with correct statistical weights. Here, we present a method of molecular fragment replacement that makes it possible to perform equilibrium simulations of proteins, and hence to determine their free energy landscapes. This strategy is based on the encoding of the chemical shift information in a probabilistic model in Markov chain Monte Carlo simulations. First, we demonstrate that with this approach it is possible to fold proteins to their native states starting from extended structures. Second, we show that the method satisfies the detailed balance condition and hence it can be used to carry out an equilibrium sampling from the Boltzmann distribution corresponding to the force field used in the simulations. Third, by comparing the results of simulations carried out with and without chemical shift restraints we describe quantitatively the effects that these restraints have on the free energy landscapes of proteins. Taken together, these results demonstrate that the molecular fragment replacement strategy can be used in combination with chemical shift information to characterize not only the native structures of proteins but also their conformational fluctuations.

The Status of Structural Genomics Defined Through the Analysis of Current Targets and Structures

Philip E. Bourne, C. K. J. Allerston, Werner G. Krebs, Wilfred W. Li, Ilya N. Shindyalov, Adam Godzik, Iddo Friedberg, Tong Liu, David L. Wild, Seungwoo Hwang, Zoubin Ghahramani, Li Chen, John D. Westbrook, 2004. (In Pacific Symposium on Biocomputing). Edited by Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, Teri E. Klein. World Scientific. ISBN: 981-238-598-3.

Abstract▼ URL

Structural genomics–large-scale macromolecular 3-dimenional structure determination–is unique in that major participants report scientific progress on a weekly basis. The target database (TargetDB) maintained by the Protein Data Bank (http://targetdb.pdb.org) reports this progress through the status of each protein sequence (target) under consideration by the major structural genomics centers worldwide. Hence, TargetDB provides a unique opportunity to analyze the potential impact that this major initiative provides to scientists interested in the sequence-structure-function-disease paradigm. Here we report such an analysis with a focus on: (i) temporal characteristics–how is the project doing and what can we expect in the future? (ii) target characteristics–what are the predicted functions of the proteins targeted by structural genomics and how biased is the target set when compared to the PDB and to predictions across complete genomes? (iii) structures solved–what are the characteristics of structures solved thus far and what do they contribute? The analysis required a more extensive database of structure predictions using different methods integrated with data from other sources. This database, associated tools and related data sources are available from http://spam.sdsc.edu.

Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction

Wenlin Chen, Austin Tripp, José Miguel Hernández-Lobato, 2022. (arXiv).

Abstract▼ URL

We propose Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework for learning deep kernel Gaussian processes (GPs) by interpolating between meta-learning and conventional deep kernel learning. Our approach employs a bilevel optimization objective where we meta-learn generally useful feature representations across tasks, in the sense that task-specific GP models estimated on top of such features achieve the lowest possible predictive loss on average. We solve the resulting nested optimization problem using the implicit function theorem (IFT). We show that our ADKF-IFT framework contains previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT) as special cases. Although ADKF-IFT is a completely general method, we argue that it is especially well-suited for drug discovery problems and demonstrate that it significantly outperforms previous state-of-the-art methods on a variety of real-world few-shot molecular property prediction tasks and out-of-domain molecular property prediction and optimization tasks.

Biomarker discovery in microarray gene expression data with Gaussian processes

Wei Chu, Zoubin Ghahramani, Francesco Falciani, David L. Wild, 2005. (Bioinformatics).

Abstract▼ URL

MOTIVATION: In clinical practice, pathological phenotypes are often labelled with ordinal scales rather than binary, e.g. the Gleason grading system for tumour cell differentiation. However, in the literature of microarray analysis, these ordinal labels have been rarely treated in a principled way. This paper describes a gene selection algorithm based on Gaussian processes to discover consistent gene expression patterns associated with ordinal clinical phenotypes. The technique of automatic relevance determination is applied to represent the significance level of the genes in a Bayesian inference framework. RESULTS: The usefulness of the proposed algorithm for ordinal labels is demonstrated by the gene expression signature associated with the Gleason score for prostate cancer data. Our results demonstrate how multi-gene markers that may be initially developed with a diagnostic or prognostic application in mind are also useful as an investigative tool to reveal associations between specific molecular and cellular events and features of tumour physiology. Our algorithm can also be applied to microarray data with binary labels with results comparable to other methods in the literature.

Identifying Protein Complexes in High-Throughput Protein Interaction Screens Using an Infinite Latent Feature Model

Wei Chu, Zoubin Ghahramani, Roland Krause, David L. Wild, 2006. (In Pacific Symposium on Biocomputing). Edited by Russ B. Altman, Tiffany Murray, Teri E. Klein, A. Keith Dunker, Lawrence Hunter. World Scientific. ISBN: 981-256-463-2.

Abstract▼ URL

We propose a Bayesian approach to identify protein complexes and their constituents from high-throughput protein-protein interaction screens. An infinite latent feature model that allows for multi-complex membership by individual proteins is coupled with a graph diffusion kernel that evaluates the likelihood of two proteins belonging to the same complex. Gibbs sampling is then used to infer a catalog of protein complexes from the interaction screen data. An advantage of this model is that it places no prior constraints on the number of complexes and automatically infers the number of significant complexes from the data. Validation results using affinity purification/mass spectrometry experimental data from yeast RNA-processing complexes indicate that our method is capable of partitioning the data in a biologically meaningful way. A supplementary web site containing larger versions of the figures is available at http://public.kgi.edu/wild/PSBO6/index.html.

Bayesian Segmental Models with Multiple Sequence Alignment Profiles for Protein Secondary Structure and Contact Map Prediction

Wei Chu, Zoubin Ghahramani, Alexei A. Podtelezhnikov, David L. Wild, 2006. (IEEE/ACM Trans. Comput. Biology Bioinform.).

Abstract▼ URL

In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in beta-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html.

A graphical model for protein secondary structure prediction

Wei Chu, Zoubin Ghahramani, David L. Wild, 2004. (In ICML). Edited by Carla E. Brodley. acm. ACM International Conference Proceeding Series.

Abstract▼ URL

In this paper, we present a graphical model for protein secondary structure prediction. This model extends segmental semi-Markov models (SSMM) to exploit multiple sequence alignment profiles which contain information from evolutionarily related sequences. A novel parameterized model is proposed as the likelihood function for the SSMM to capture the segmental conformation. By incorporating the information from long range interactions in β-sheets, this model is capable of carrying out inference on contact maps. The numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising.

Protein secondary structure prediction using sigmoid belief networks to parameterize segmental semi-Markov models

Wei Chu, Zoubin Ghahramani, David L. Wild, 2004. (In ESANN).

Abstract▼ URL

In this paper, we merge the parametric structure of neural networks into a segmental semi-Markov model to set up a Bayesian framework for protein structure prediction. The parametric model, which can also be regarded as an extension of a sigmoid belief network, captures the underlying dependency in residue sequences. The results of numerical experiments indicate the usefulness of this approach.

Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

Ananya Dubey, Seungwoo Hwang, Claudia Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, David L. Wild, 2004. (In Pacific Symposium on Biocomputing). Edited by Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany A. Jung, Teri E. Klein. World Scientific. ISBN: 981-238-598-3.

Abstract▼ URL

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/approximately wid/PSB04/index.html

Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models

A. Dubey, S. Hwang, C. Rangel, Carl Edward Rasmussen, Zoubin Ghahramani, David L. Wild, 2004. (In Pacific Symposium on Biocomputing 2004). (Pacific Symposium on Biocomputing 2004; Vol. 9). Singapore. The Big Island of Hawaii. World Scientific Publishing.

Abstract▼ URL

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.

Combining the multicanonical ensemble with generative probabilistic models of local biomolecular structure

Jes Frellsen, Thomas Hamelryck, Jesper Ferkinghoff-Borg, 2014. (In Proceedings of the 59th World Statistics Congress of the International Statistical Institute). Hong Kong.

Abstract▼ URL

Markov chain Monte Carlo is a powerful tool for sampling complex systems such as large biomolecular structures. However, the standard Metropolis-Hastings algorithm suffers from a number of deficiencies when applied to systems with rugged free-energy landscapes. Some of these deficiencies can be addressed with the multicanonical ensemble. In this paper we will present two strategies for applying the multicanonical ensemble to distributions constructed from generative probabilistic models of local biomolecular structure. In particular, we will describe how to use the multicanonical ensemble efficiently in conjunction with the reference ratio method.

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Yarin Gal, Yutian Chen, Zoubin Ghahramani, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.

No Correlation Between Childhood Maltreatment and Telomere Length.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, Tim D. Spector, 2010. (Biological Psychiatry).

Abstract▼

Telomeres are lengths of repetitive DNA that cap the ends of chromosomes. They protect the ends of the chromosome and shorten with each cell division. Short leukocyte telomere length has been related to a number of age-related diseases. In addition, shorter telomere length has been associated with environmental factors such as smoking and lack of exercise. In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a report suggesting a link between maltreatment in childhood and telomere shortening in 31 subjects. Individuals who had suffered maltreatment had telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had not been abused.

Adaptable probabilistic mapping of short reads using position specific scoring matrices

Peter Kerpedjiev, Jes Frellsen, Stinus Lindgreen, Anders Krogh, 2014. (BMC bioinformatics). DOI: 10.1186/1471-2105-15-100.

Abstract▼

BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases.RESULTS:We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.

Comment: Peter Kerpedjiev and Jes Frellsen contributed equally. Additional resources are available at bwa-pssm.binf.ku.dk

Bayesian Correlated clustering to integrate multiple datasets

Paul D. W. Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, David L. Wild, 2012. (Bioinformatics).

Abstract▼ URL

MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. RESULTS: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.

Bayesian correlated clustering to integrate multiple datasets

P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, D. L. Wild, 2012. (Bioinformatics).

Abstract▼ URL

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods.

Comment: This paper is available from the Bioinformatics site and a Matlab implementation of MDI is available fromthis site.

Infinite Sparse Factor Analysis and Infinite Independent Components Analysis

David Knowles, Zoubin Ghahramani, September 2007. (In 7th International Conference on Independent Component Analysis and Signal Separation). London, UK. Springer. DOI: 10.1007/978-3-540-74494-8_48.

Abstract▼ URL

A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.

Nonparametric Bayesian Sparse Factor Models with application to Gene Expression modelling.

David A. Knowles, Zoubin Ghahramani, 2011. (Annals of Applied Statistics).

Abstract▼ URL

A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model’s utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

Statistical tools for ultra-deep pyrosequencing of fast evolving viruses

David A. Knowles, Susan Holmes, 2009. (In NIPS Workshop: Computational Biology).

Abstract▼ URL

We aim to detect minor variant Hepatitis B viruses (HBV) in 38 pyrosequencing samples from infected individuals. Errors involved in the amplification and ultra deep pyrosequencing (UDPS) of these samples are characterised using HBV plasmid controls. Homopolymeric regions and quality scores are found to be significant covariates in determining insertion and deletion (indel) error rates, but not mismatch rates which depend on the nucleotide transition matrix. This knowledge is used to derive two methods for classifying genuine mutations: a hypothesis testing framework and a mixture model. Using an approximate “ground truth” from a limiting dilution Sanger sequencing run, these methods are shown to outperform the naive percentage threshold approach. The possibility of early stage PCR errors becoming significant is investigated by simulation, which underlines the importance of the initial copy number.

Comment: web site

Modeling skin and ageing phenotypes using latent variable models in Infer.NET

David A. Knowles, Leopold Parts, Daniel Glass, John M. Winn, 2010. (In NIPS Workshop: Predictive Models in Personalized Medicine Workshop).

Abstract▼ URL

We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study. We address various data modeling problems include high missingness, heterogeneous data, and repeat observations. We compare the proposed models in terms of their performance at predicting disease labels and symptoms from available explanatory variables, concluding that factor analysis type models have the strongest statistical performance in this setting. We show that such models can be combined with regression components for improved interpretability.

Comment: web site

Gene function prediction from synthetic lethality networks via ranking on demand

C. Lippert, Z. Ghahramani, K. Borgwardt, 2010. (Bioinformatics).

Abstract▼ URL

Motivation: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks. Results: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.

A kernel method for unsupervised structured network inference

C. Lippert, O. Stegle, Z. Ghahramani, K. Borgwardt, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Journal of Machine Learning Research. Note: ISSN: 1938-7228.

Abstract▼ URL

Network inference is the problem of inferring edges between a set of real-world objects, for instance, interactions between pairs of proteins in bioinformatics. Current kernel-based approaches to this problem share a set of common features: (i) they are supervised and hence require labeled training data; (ii) edges in the network are treated as mutually independent and hence topological properties are largely ignored; (iii) they lack a statistical interpretation. We argue that these common assumptions are often undesirable for network inference, and propose (i) an unsupervised kernel method (ii) that takes the global structure of the network into account and (iii) is statistically motivated. We show that our approach can explain commonly used heuristics in statistical terms. In experiments on social networks, dfferent variants of our method demonstrate appealing predictive performance.

On the Accuracy of Short Read Mapping

Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, Anders Krogh, 2013. (In Deep Sequencing Data Analysis). Springer. DOI: 10.1007/978-1-62703-514-9_3.

Abstract▼

The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.

Comment: Peter Menzel and Jes Frellsen contributed equally.

Distinct epigenomic features in human cardiomyopathy

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider, Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin Goddard, Pietro Lio, Martin Bennett, Roger Foo, 2011. (Circulation, American Heart Association).

Abstract▼ URL

Background. The epigenome refers to marks on the genome including DNA methylation and histone modifications that regulate the expression of underlying genes. A consistent profile of gene expression changes in end- stage cardiomyopathy led us to hypothesise that distinct global patterns of the epigenome may also exist. Methods and Results. We constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36 tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human hearts. 506Mb of sequence per library was generated by high-throughput sequencing, covering 24 million out of the 28 million CG di-nucleotides in the human genome. DNA methylation was significantly different in promoter CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions of the genome. Moreover DNA methylation differences were present in promoters of upregulated genes but not down-regulated genes. The profile of H3K36me3-enrichment itself was also significantly different in protein-coding regions of the genome. Conclusions. Distinct epigenomic patterns exist in important DNA elements of the human cardiac genome in end-stage cardiomyopathy. If epigenomic patterns track with disease progression, assays for the epigenome may be more useful than quantification of mRNA for assessing prognosis in heart failure. These results open up an important new horizon of research and further studies will be needed to determine how epigenomics contribute to altered gene expression in cardiomyopathy.

Modeling T-cell activation using gene expression profiling and state-space models

Claudia Rangel, John Angus, Zoubin Ghahramani, Maria Lioumi, Elizabeth Sotheran, Alessia Gaiba, David L. Wild, Francesco Falciani, 2004. (Bioinformatics).

Abstract▼ URL

Motivation: We have used state-space models to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T-cell activation. State space models are a class of dynamic Bayesian networks that assume that the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects that cannot be measured in a gene expression profiling experiment, e.g. genes that have not been included in the microarray, levels of regulatory proteins, the effects of messenger RNA and protein degradation, etc. Results: Bootstrap confidence intervals are developed for parameters representing `gene–gene’ interactions over time. Our models represent the dynamics of T-cell activation and provide a methodology for the development of rational and experimentally testable hypotheses. Availability: Supplementary data and Matlab computer source code will be made available on the web at the URL given below. Supplementary information: .

Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, David L. Wild, 2009. (IEEE/ACM Transactions on Computational Biology and Bioinformatics). DOI: 10.1109/TCBB.2007.70269. ISSN: 1545-5963.

Abstract▼ URL

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

A Bayesian network model for protein fold and remote homologue recognition

A. Raval, Zoubin Ghahramani, David L. Wild, 2002. (Bioinformatics).

Abstract▼ URL

Motivation: The Bayesian network approach is a framework which combines graphical representation and probability theory, which includes, as a special case, hidden Markov models. Hidden Markov models trained on amino acid sequence or secondary structure data alone have been shown to have potential for addressing the problem of protein fold and superfamily classification. Results: This paper describes a novel implementation of a Bayesian network which simultaneously learns amino acid sequence, secondary structure and residue accessibility for proteins of known three-dimensional structure. An awareness of the errors inherent in predicted secondary structure may be incorporated into the model by means of a confusion matrix. Training and validation data have been derived for a number of protein superfamilies from the Structural Classification of Proteins (SCOP) database. Cross validation results using posterior probability classification demonstrate that the Bayesian network performs better in classifying proteins of known structural superfamily than a hidden Markov model trained on amino acid sequences alone.

Discovering Transcriptional Modules by Bayesian Data Integration

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, D. L. Wild, 2010. (Bioinformatics).

Abstract▼ URL

Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs.Availability: If interested in the code for the work presented in this article, please contact the authors.

R/BHC: fast Bayesian hierarchical clustering for microarray data

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant, K. Denby, D. L. Wild, August 2009. (BMC Bioinformatics 2009). BioMed Central. DOI: 10.1186/1471-2105-10-242. ISSN: 1471-2105. PubMed ID: 19660130.

Abstract▼ URL

Background: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained. Results: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge. Conclusion: Biologically plausible results are presented from a well studied data set: expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.

Dichotomous cellular properties of mouse orexin/hypocretin neurons

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, Denis Burdakov, 2011. (The Journal of Physiology).

Abstract▼ URL

Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as critical regulators of sleep-wake cycles, reward-seeking, and body energy balance. However, at the level of cellular and network properties, it remains unclear whether hcrt/orx neurons are one homogenous population, or whether there are several distinct types of hcrt/orx cells. Here, we collated diverse structural and functional information about individual hcrt/orx neurons in mouse brain slices, by combining patch-clamp analysis of spike firing, membrane currents, and synaptic inputs with confocal imaging of cell shape and subsequent 3-dimensional Sholl analysis of dendritic architecture. Statistical cluster analysis of intrinsic firing properties revealed that hcrt/orx neurons fall into two distinct types. These two cell types also differ in the complexity of their dendritic arbour, the strength of AMPA and GABAA receptor-mediated synaptic drive that they receive, and the density of low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results provide quantitative evidence that, at the cellular level, the mouse hcrt/orx system is composed of two classes of neurons with different firing properties, morphologies, and synaptic input organization.

Ranking Relations Using Analogies in Biological and Information Networks

R. Silva, K. A. Heller, Z. Ghahramani, E. M. Airoldi, 2010. (Annals of Applied Statistics).

Abstract▼ URL

Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = A(1):B(1), A(2):B(2), …, A(N):B(N), measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.

Robust estimation of local genetic ancestry in admixed populations using a non-parametric Bayesian approach

Kyung-Ah Sohn, Zoubin Ghahramani, Eric P. Xing, 2012. (Genetics).

Abstract▼ URL

We present a new haplotype-based approach for inferring local genetic ancestry of individuals in an admixed population. Most existing approaches for local ancestry estimation ignore the latent genetic relatedness between ancestral populations and treat them as independent. In this paper, we exploit such information by building an inheritance model that describes both the ancestral populations and the admixed population jointly in a unified framework. Based on an assumption that the common hypothetical founder haplotypes give rise to both the ancestral and admixed population haplotypes, we employ an infinite hidden Markov model to characterize each ancestral population and further extend it to generate the admixed population. Through an effective utilization of the population structural information under a principled nonparametric Bayesian framework, the resulting model is significantly less sensitive to the choice and the amount of training data for ancestral populations than state-of-the-arts algorithms. We also improve the robustness under deviation from common modeling assumptions by incorporating population-specific scale parameters that allow variable recombination rates in different populations. Our method is applicable to an admixed population from an arbitrary number of ancestral populations and also performs competitively in terms of spurious ancestry proportions under general multi-way admixture assumption. We validate the proposed method by simulation under various admixing scenarios and present empirical analysis results on worldwide distributed dataset from Human Genome Diversity Project.

Comment: doi: 10.1534/genetics.112.140228

A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, K. M. Borgwardt, 2010. (Journal of Computational Biology). DOI: 10.1089/cmb.2009.0175.

Abstract▼ URL

Understanding the regulatory mechanisms that are responsible for an organism’s response to environmental change is an important issue in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a twosample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates, and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 observed time points. In classification experiments, our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

Discovering temporal patterns of differential gene expression in microarray time series

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, K Borgwardt, September 2009. (In German Conference on Bioinformatics). Halle, Germany.

Abstract▼ URL

A wealth of time series of microarray measurements have become available over recent years. Several two-sample tests for detecting differential gene expression in these time series have been defined, but they can only answer the question whether a gene is differentially expressed across the whole time series, not in which intervals it is differentially expressed. In this article, we propose a Gaussian process based approach for studying these dynamics of differential gene expression. In experiments on Arabidopsis thaliana gene expression levels, our novel technique helps us to uncover that the family of WRKY transcription factors appears to be involved in the early response to infection by a fungal pathogen.

A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, Karsten Borgwardt, 2009. (In 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2009)). Tucson, AZ, USA. Springer-Verlag. Lecture Notes in Bioinformatics. DOI: 10.1007/978-3-642-02008-7_14. ISBN: 978-3-642-02007-0.

Abstract▼ URL

Understanding the regulatory mechanisms that are responsible for an organism’s response to environmental changes is an important question in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 time points. In classification experiments our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

An Evaluation Framework for the Objective Functions of De Novo Drug Design Benchmarks

Austin Tripp, Wenlin Chen, José Miguel Hernández-Lobato, 2022. (In ICLR 2022 Workshop on Machine Learning for Drug Discovery).

Abstract▼ URL

De novo drug design has recently received increasing attention from the machine learning community. It is important that the field is aware of the actual goals and challenges of drug design and the roles that de novo molecule design algorithms could play in accelerating the process, so that algorithms can be evaluated in a way that reflects how they would be applied in real drug design scenarios. In this paper, we propose a framework for critically assessing the merits of benchmarks, and argue that most of the existing de novo drug design benchmark functions are either highly unrealistic or depend upon a surrogate model whose performance is not well characterized. In order for the field to achieve its long-term goals, we recommend that poor benchmarks (especially logP and QED) be deprecated in favour of better benchmarks. We hope that our proposed framework can play a part in developing new de novo drug design benchmarks that are more realistic and ideally incorporate the intrinsic goals of drug design.

Inferring the effectiveness of government interventions against COVID-19

Jan M Brauner, Sören Mindermann, Mrinank Sharma, David Johnston, John Salvatier, Tomáš Gavenčiak, Anna B Stephenson, Gavin Leech, George Altman, Vladimir Mikulik, Alexander John Norman, Joshua Teperowski Monrad, Tamay Besiroglu, Hong Ge, Meghan A Hartwick, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal, Jan Kulveit, December 2020. (Science).

URL

Interoperability of statistical models in pandemic preparedness: principles and reality

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes, Sylvia Richardson, May 2022. (Stat. Sci.).

Abstract▼ URL

We present interoperability as a guiding framework for statistical modelling to assist policy makers asking multiple questions using diverse datasets in the face of an evolving pandemic response. Interoperability provides an important set of principles for future pandemic preparedness, through the joint design and deployment of adaptable systems of statistical models for disease surveillance using probabilistic reasoning. We illustrate this through case studies for inferring and characterising spatial-temporal prevalence and reproduction numbers of SARS-CoV-2 infections in England.