Reinforcement learning (RL) is a general computational approach to experience-based goal-directed learning for sequential decision making under uncertainty. However, RL often lacks efficiency in terms of the number of required trials when no task-specific knowledge is available. This lack of efficiency makes RL often inapplicable to (optimal) control problems. Thus, a central issue in RL is to speed up learning by extracting more information from available experience.

The contributions of this dissertation are threefold:

1. We propose PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available. PILCO is based on well-established ideas from statistics and machine learning. PILCO's key ingredient is a probabilistic dynamics model learned from data, which is implemented by a Gaussian process (GP). The GP carefully quantifies knowledge by a probability distribution over plausible dynamics models. By averaging over all these models during long-term planning and decision making, PILCO takes uncertainties into account in a principled way and, therefore, reduces model bias, a central problem in model-based RL.

2. Due to its generality and efficiency, PILCO can be considered a conceptual and practical approach to jointly learning models and controllers when expert knowledge is difficult to obtain or simply not available. For this scenario, we investigate PILCO's properties its applicability to challenging real and simulated nonlinear control problems. For example, we consider the tasks of learning to swing up a double pendulum attached to a cart or to balance a unicycle with five degrees of freedom. Across all tasks we report unprecedented automation and an unprecedented learning efficiency for solving these tasks.

3. As a step toward pilco's extension to partially observable Markov decision processes, we propose a principled algorithm for robust filtering and smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for nonlinear systems, it does neither rely on function linearization nor on finite-sample representations of densities. Our algorithm profits from exact moment matching for predictions while keeping all computations analytically tractable. We present experimental evidence that demonstrates the robustness and the advantages of our method over unscented Kalman filters, the cubature Kalman filter, and the extended Kalman filter.} } @article{DeiFoxRas15, cat = {gp rl}, author = {Marc Peter Deisenroth and Dieter Fox and Carl Edward Rasmussen}, title = {Gaussian Processes for Data-Efficient Learning in Robotics and Control}, journal = PAMI, pages = {408--423}, volume = 37, issue = 2, year = 2015, abstract = {Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-speciﬁc knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or speciﬁc knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.}, doi = {10.1109/TPAMI.2013.218} } @inproceedings{DeiHubHan09, cat = {gp time}, author = {Marc Peter Deisenroth and Marco F.~Huber and Uwe D.~Hanebeck}, title = {Analytic Moment-based {G}aussian Process Filtering}, booktitle = icml26, year = 2009, editor = {L\'{e}on Bottou and Michael Littman}, pages = {225--232}, address = {Montr\'{e}al, QC, Canada}, month = {June}, publisher = {Omnipress}, url = {.}, abstract = {We propose an analytic moment-based filter for nonlinear stochastic dynamic systems modeled by Gaussian processes. Exact expressions for the expected value and the covariance matrix are provided for both the prediction step and the filter step, where an additional Gaussian assumption is exploited in the latter case. Our filter does not require further approximations. In particular, it avoids finite-sample approximations. We compare the filter to a variety of Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed by Ko et al. (2007).}, annote = {With corrections. code.} } @inproceedings{DeiPetRas08, cat = {gp approx}, author = {Marc Peter Deisenroth and Jan Peters and Carl Edward Rasmussen}, title = {Approximate Dynamic Programming with {G}aussian Processes}, booktitle = {2008 American Control Conference (ACC 2008)}, year = 2008, pages = {4480--4485}, address = {Seattle, WA, USA}, month = {June}, url = {.}, abstract = {In general, it is difficult to determine an optimal closed-loop policy in nonlinear control problems with continuous-valued state and control domains. Hence, approximations are often inevitable. The standard method of discretizing states and controls suffers from the curse of dimensionality and strongly depends on the chosen temporal sampling rate. The paper introduces Gaussian Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman recursion of the dynamic programming algorithm are modeled using Gaussian processes. GPDP returns an optimal state-feedback for a finite set of states. Based on these outcomes, we learn a possibly discontinuous closed-loop policy on the entire state space by switching between two independently trained Gaussian processes.}, annote = {code.} } @inproceedings{DeiRas09, cat = {rl}, author = {Marc Peter Deisenroth and Carl Edward Rasmussen}, title = {Bayesian Inference for Efficient Learning in Control}, booktitle = {Multidisciplinary Symposium on Reinforcement Learning}, year = 2009, address = {Montr\'{e}al, QC, Canada}, month = {June}, url = {.}, abstract = {In contrast to humans or animals, artificial learners often require more trials when learning motor control tasks solely based on experience. Efficient autonomous learners will reduce the amount of engineering required to solve control problems. By using probabilistic forward models, we can employ two key ingredients of biological learning systems to speed up artificial learning. We present a consistent and coherent Bayesian framework that allows for efficient autonomous experience-based learning. We demonstrate the success of our learning algorithm by applying it to challenging nonlinear control problems in simulation and in hardware.} } @inproceedings{DeiRas09b, cat = {rl}, author = {Marc Peter Deisenroth and Carl Edward Rasmussen}, title = {Efficient Reinforcement Learning for Motor Control}, booktitle = {10th International PhD Workshop on Systems and Control}, year = 2009, address = {Hlubok\'{a} nad Vltavou, Czech Republic}, month = {September}, url = {.}, abstract = {Artificial learners often require many more trials than humans or animals when learning motor control tasks in the absence of expert knowledge. We implement two key ingredients of biological learning systems, generalization and incorporation of uncertainty into the decision-making process, to speed up artificial learning. We present a coherent and fully Bayesian framework that allows for efficient artificial learning in the absence of expert knowledge. The success of our learning framework is demonstrated on challenging nonlinear control problems in simulation and in hardware.} } @inproceedings{DeiRas11, cat = {gp rl}, author = {Marc Peter Deisenroth and Carl Edward Rasmussen}, title = {{PILCO}: {A} Model-Based and Data-Efficient Approach to Policy Search}, booktitle = icml28, year = 2011, abstract = {In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.}, url = {.}, annote = {web site} } @inproceedings{DeiRasFox11, cat = {rl}, author = {Marc Peter Deisenroth and Carl Edward Rasmussen and Dieter Fox}, title = {Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning}, booktitle = rss9, year = 2011, month = {June}, address = {Los Angeles, CA, USA}, abstract = {Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful of trials - from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the work space of a Kinect-style depth camera, we use a model-based reinforcement learning technique. Our learning method is data efficient, reduces model bias, and deals with several noise sources in a principled way during long-term planning. We present a way of incorporating state-space constraints into the learning process and analyze the learning gain by exploiting the sequential structure of the stacking task.}, url = {.}, annote = {project site} } @inproceedings{DeiRasPet08, cat = {rl gp}, author = {Marc Peter Deisenroth and Carl Edward Rasmussen and Jan Peters}, title = {Model-Based Reinforcement Learning with Continuous States and Actions}, booktitle = {Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN 2008)}, year = 2008, pages = {19--24}, address = {Bruges, Belgium}, month = {April}, url = {.}, abstract = {Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment.}, annote = {code. slides} } @article{DeiRasPet09, cat = {rl gp}, volume = 72, number = {7--9}, month = {March}, author = {Marc Peter Deisenroth and Carl Edward Rasmussen and Jan Peters}, title = {Gaussian process dynamic programming}, publisher = {Elsevier B. V.}, year = 2009, journal = {Neurocomputing}, pages = {1508--1524}, url = {.}, abstract = {Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm. We consider both a classic optimal control problem, where problem-specific prior knowledge is available, and a classic RL problem, where only very general priors can be used. For the classic optimal control problem, GPDP models the unknown value functions with Gaussian processes and generalizes dynamic programming to continuous-valued states and actions. For the RL problem, GPDP starts from a given initial state and explores the state space using Bayesian active learning. To design a fast learner, available data have to be used efficiently. Hence, we propose to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly. In both cases, we successfully apply the resulting continuous-valued controllers to the under-actuated pendulum swing up and analyze the performances of the suggested algorithms. It turns out that GPDP uses data very efficiently and can be applied to problems, where classic dynamic programming would be cumbersome.}, annote = {code.}, doi = {10.1016/j.neucom.2008.12.019} } @article{DeiTurHubetal12, cat = {gp time}, author = {Marc Peter Deisenroth and Ryan D.~Turner and Marco F.~Huber and Uwe D.~Hanebeck and Carl Edward Rasmussen}, title = {Robust Filtering and Smoothing with {G}aussian Processes}, journal = IETAC, volume = 57, year = 2012, number = 7, pages = {1865--1871}, abstract = {We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by nonparametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of "system identification" is more robust than finding point estimates of a parametric function representation. Our principled filtering/smoothing approach for GP dynamic systems is based on analytic moment matching in the context of the forward-backward algorithm. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.}, url = {.}, doi = {10.1109/TAC.2011.2179426} } @mastersthesis{Dos09, cat = {np}, author = {Finale Doshi-Velez}, title = {The {I}ndian Buffet Process: {S}calable Inference and Extensions}, school = {University of Cambridge}, year = 2009, address = {Cambridge, UK}, month = {August}, url = {.}, abstract = {Many unsupervised learning problems seek to identify hidden features from observations. In many real-world situations, the number of hidden features is unknown. To avoid specifying the number of hidden features a priori, one can use the Indian Buffet Process (IBP): a nonparametric latent feature model that does not bound the number of active features in a dataset. While elegant, the lack of efficient inference procedures for the IBP has prevented its application in large-scale problems. The core contribution of this thesis are three new inference procedures that allow inference in the IBP to be scaled from a few hundred to 100,000 observations. This thesis contains three parts: (1) An introduction to the IBP and a review of inference techniques and extensions. The first chapters summarise three constructions for the IBP and review all currently published inference techniques. Appendix C reviews extensions of the IBP to date. (2) Novel techniques for scalable Bayesian inference. This thesis presents three new inference procedures: (a) an accelerated Gibbs sampler for efficient Bayesian inference in a broad class of conjugate models, (b) a parallel, asynchronous Gibbs sampler that allows the accelerated Gibbs sampler to be distributed across multiple processors, and (c) a variational inference procedure for the IBP. (3) A framework for structured nonparametric latent feature models. We also present extensions to the IBP to model more sophisticated relationships between the co-occurring hidden features, providing a general framework for correlated non-parametric feature models.} } @inproceedings{Dos09b, author = {Finale Doshi-Velez}, title = {The Infinite Partially Observable {M}arkov Decision Process}, booktitle = nips23, year = 2009, address = {Cambridge, MA, USA}, month = {December}, publisher = mit, url = {.}, abstract = {The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems.} } @inproceedings{DosGha09, cat = {np mcmc}, booktitle = icml26, title = {Accelerated {Gibbs} sampling for the {Indian} buffet process}, author = {Finale Doshi-Velez and Zoubin Ghahramani}, year = 2009, address = {Montr\'{e}al, QC, Canada}, month = {June}, publisher = {Omnipress}, pages = {273--280}, editor = {L\'{e}on Bottou and Michael Littman}, url = {.}, abstract = {We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a non-parametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.} } @inproceedings{DosGha09a, author = {Doshi-Velez, Finale and Ghahramani, Zoubin}, booktitle = {ICML}, cat = {np, mcmc}, editor = {Danyluk, Andrea Pohoreckyj and Bottou, L{\'e}on and Littman, Michael L.}, isbn = {978-1-60558-516-1}, pages = 35, publisher = {acm}, series = {ACM International Conference Proceeding Series}, title = {Accelerated sampling for the {I}ndian Buffet Process}, url = {.}, volume = 382, year = 2009, abstract = {We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a nonparametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.} } @inproceedings{DosGha09b, cat = {np}, booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI 2009)}, title = {Correlated non-parametric latent feature models}, author = {F. Doshi-Velez and Z. Ghahramani}, year = 2009, address = {Montr\'{e}al, QC, Canada}, month = {June}, pages = {143--150}, publisher = {AUAI Press}, url = {.}, abstract = {We are often interested in explaining data through a set of hidden factors or features. To allow for an unknown number of such hidden features, one can use the IBP: a non-parametric latent feature model that does not bound the number of active features in a dataset. However, the IBP assumes that all latent features are uncorrelated, making it inadequate for many real-world problems. We introduce a framework for correlated non-parametric feature models, generalising the IBP. We use this framework to generate several specific models and demonstrate applications on real-world datasets.} } @inproceedings{DosGha11, cat = {rl}, author = {Finale Doshi-Velez and Zoubin Ghahramani}, year = 2011, title = {A Comparison of Human and Agent Reinforcement Learning in Partially Observable Domains}, booktitle = {33rd Annual Meeting of the Cognitive Science Society}, address = {Boston, MA}, url = {.}, abstract = {It is commonly stated that reinforcement learning (RL) algorithms learn slower than humans. In this work, we investigate this claim using two standard problems from the RL literature. We compare the performance of human subjects to RL techniques. We find that context---the meaningfulness of the observations—--plays a significant role in the rate of human RL. Moreover, without contextual information, humans often fare much worse than classic algorithms. Comparing the detailed responses of humans and RL algorithms, we also find that humans appear to employ rather different strategies from standard algorithms, even in cases where they had indistinguishable performance to them. Our research both sheds light on human RL and provides insights for improving RL algorithms.} } @inproceedings{DosKnoMohGha09, cat = {np approx}, url = {.}, author = {Finale Doshi-Velez and David Knowles and Shakir Mohamed and Zoubin Ghahramani}, title = {Large Scale Non-parametric Inference: {D}ata Parallelisation in the {I}ndian Buffet Process}, booktitle = nips23, pages = {1294--1302}, address = {Cambridge, MA, USA}, publisher = mit, year = 2009, month = {December}, abstract = {Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.} } @inproceedings{DosMilVanTeh09, cat = {np}, author = {Doshi-Velez, F. and Miller, K.T. and {Van Gael}, J. and Teh, Y.W.}, booktitle = aistats12, keywords = {Factor Analysis,IBP,Variational}, mendeley-tags ={Factor Analysis,IBP,Variational}, pages = {137--144}, publisher = jmlr, address = {Clearwater Beach, FL, USA}, title = {Variational inference for the {I}ndian buffet process}, url = {.}, volume = 12, year = 2009, month = {April}, abstract = {The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on a truncated stick-breaking approximation, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes.} } @techreport{DosMilVanTeh09b, cat = {np}, author = {Finale Doshi-Velez and Kurt T. Miller and Jurgen {Van Gael} and Yee Whye Teh}, title = {Variational Inference for the {I}ndian Buffet Process}, institution = {University of Cambridge}, year = 2009, number = {CBL-2009-001}, address = {Computational and Biological Learning Laboratory, Department of Engineering}, month = {April}, url = {.}, abstract = {The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on truncating to infinite models, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes. This technical report is a longer version of Doshi-Velez et al. (2009).} } @article{DosRoy08, author = {Finale Doshi and Nicholas Roy}, title = {Spoken Language Interaction with Model Uncertainty: {A}n Adaptive Human-Robot Interaction System}, journal = {Connection Science}, year = 2008, volume = 20, pages = {290--318}, number = 4, month = {December}, url = {.}, abstract = {Spoken language is one of the most intuitive forms of interaction between humans and agents. Unfortunately, agents that interact with people using natural language often experience communication errors and do not correctly understand the user's intentions. Recent systems have successfully used probabilistic models of speech, language, and user behavior to generate robust dialog performance in the presence of noisy speech recognition and ambiguous language choices, but decisions made using these probabilistic models are still prone to errors due to the complexity of acquiring and maintaining a complete model of human language and behavior. In this paper, we describe a decision-theoretic model for human-robot interaction using natural language. Our algorithm is based on the Partially Observable Markov Decision Process (POMDP), which allows agents to choose actions that are robust not only to uncertainty from noisy or ambiguous speech recognition but also unknown user models. Like most dialog systems, a POMDP is defined by a large number of parameters that may be difficult to specify a priori from domain knowledge, and learning these parameters from the user may require an unacceptably long training period. We describe an extension to the POMDP model that allows the agent to acquire a linguistic model of the user online, including new vocabulary and word choice preferences. Our approach not only avoids a training period of constant questioning as the agent learns, but also allows the agent to actively query for additional information when its uncertainty suggests a high risk of mistakes. We demonstrate our approach both in simulation and on a natural language interaction system for a robotic wheelchair application.} } @inproceedings{DubHwaRan04a, author = {Dubey, Ananya and Hwang, Seungwoo and Rangel, Claudia and Rasmussen, Carl Edward and Ghahramani, Zoubin and Wild, David L.}, booktitle = {Pacific Symposium on Biocomputing}, cat = {bioinf}, editor = {Altman, Russ B. and Dunker, A. Keith and Hunter, Lawrence and Jung, Tiffany A. and Klein, Teri E.}, isbn = {981-238-598-3}, pages = {399-410}, publisher = {World Scientific}, title = {Clustering Protein Sequence and Structure Space with Infinite {G}aussian Mixture Models}, url = {.}, year = 2004, abstract = {We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/approximately wid/PSB04/index.html } } @inproceedings{DubHwaRanetal04, cat = {np bioinf clust}, author = {A.~Dubey and S.~Hwang and C.~Rangel and Carl Edward Rasmussen and Zoubin Ghahramani and David L.~Wild}, title = {Clustering Protein Sequence and Structure Space with Infinite {G}aussian Mixture Models}, year = 2004, publisher = {World Scientific Publishing}, pages = {399--410}, journal = {Pacific Symposium on Biocomputing 2004; Vol. 9}, address = {Singapore}, abstract = {We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the thoery of infinite Gaussian mixture models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known tree-dimensional structures and G-pretein coupled receptor sequences. The consistency of the clusters indicate that that our methods is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which reflects and extends their SCOP classifications.}, url = {.}, booktitle = {Pacific Symposium on Biocomputing 2004}, location = {The Big Island of Hawaii} } @inproceedings{DuvLloGroetal13, cat = {gp np time sigproc}, title = {Structure Discovery in Nonparametric Regression through Compositional Kernel Search}, author = {David Duvenaud and James Robert Lloyd and Roger Grosse and Joshua B. Tenenbaum and Zoubin Ghahramani}, booktitle = icml30, year = 2013, month = {June}, address = {Atlanta, Georgia, USA}, url = {http://jmlr.org/proceedings/papers/v28/duvenaud13.pdf}, abstract = {Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.} } @inproceedings{DuvNicRas11, cat = {gp}, booktitle = nips24, title = {Additive {G}aussian Processes}, author = {David Duvenaud and Hannes Nickisch and Carl Edward Rasmussen}, year = 2011, address = {Granada, Spain}, pages = {226--234}, url = {http://papers.nips.cc/paper/4221-additive-gaussian-processes}, abstract = {We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks.} } @inproceedings{DuvRipAdaGha14, cat = {gp np deep}, title = {Avoiding pathologies in very deep networks}, author = {David Duvenaud and Oren Rippel and Ryan P. Adams and Zoubin Ghahramani}, booktitle = aistats17, year = 2014, month = {April}, address = {Reykjavik, Iceland}, url = {http://arxiv.org/pdf/1402.5836.pdf}, abstract = {Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.} } @inproceedings{DziRoyGha15, cat = {deep}, title = {Training generative neural networks via {M}aximum {M}ean {D}iscrepancy optimization}, booktitle = uai31, year = 2015, month = {July}, address = {Amsterdam, The Netherlands}, author = {Gintare Karolina Dziugaite and Daniel M. Roy and Zoubin Ghahramani}, pages = {258--267}, abstract = {We consider training a deep neural network to generate samples from an unknown distribution given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic---informally speaking, a good generator network produces samples that cause a two-sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an unbiased estimate of the maximum mean discrepancy, which is the centerpiece of the nonparametric kernel two-sample test proposed by Gretton et al. (2012). We compare to the adversarial nets framework introduced by Goodfellow et al. (2014), in which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this perspective, the MMD statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical MMD.}, url = {http://auai.org/uai2015/proceedings/papers/230.pdf}, } @inproceedings{EatGha09, cat = {gm}, author = {Frederik Eaton and Zoubin Ghahramani}, title = {Choosing a Variable to Clamp: {A}pproximate Inference Using Conditioned Belief Propagation}, booktitle = aistats12, pages = {145--152}, year = 2009, editor = {D. van Dyk and M. Welling}, volume = 5, address = {Clearwater Beach, FL, USA}, month = {April}, publisher = jmlr, url = {.}, annote = {Code (in C++ based on libDAI).}, abstract = {In this paper we propose an algorithm for approximate inference on graphical models based on belief propagation (BP). Our algorithm is an approximate version of Cutset Conditioning, in which a subset of variables is instantiated to make the rest of the graph singly connected. We relax the constraint of single-connectedness, and select variables one at a time for conditioning, running belief propagation after each selection. We consider the problem of determining the best variable to clamp at each level of recursion, and propose a fast heuristic which applies back-propagation to the BP updates. We demonstrate that the heuristic performs better than selecting variables at random, and give experimental results which show that it performs competitively with existing approximate inference algorithms.} } @article{EatGha13a, author = {Eaton, Frederik and Ghahramani, Zoubin}, cat = {gm}, journal = {Neural Computation}, number = 5, pages = {1213-1260}, title = {Model Reductions for Inference: Generality of Pairwise, Binary, and Planar Factor Graphs}, url = {.}, volume = 25, year = 2013, abstract = {We offer a solution to the problem of efficiently translating algorithms between different types of discrete statistical model. We investigate the expressive power of three classes of model-those with binary variables, with pairwise factors, and with planar topology-as well as their four intersections. We formalize a notion of "simple reduction" for the problem of inferring marginal probabilities and consider whether it is possible to "simply reduce" marginal inference from general discrete factor graphs to factor graphs in each of these seven subclasses. We characterize the reducibility of each class, showing in particular that the class of binary pairwise factor graphs is able to simply reduce only positive models. We also exhibit a continuous "spectral reduction" based on polynomial interpolation, which overcomes this limitation. Experiments assess the performance of standard approximate inference algorithms on the outputs of our reductions. } } @inproceedings{EicTolZieetal04, author = {Jan Eichhorn and Andreas S.~Tolias and Alexander Zien and Malte Ku{\ss} and Carl Edward Rasmussen and Jason Weston and Nikos K.~Logothetis and Bernhard Sch{\"o}lkopf}, title = {Prediction on Spike Data Using Kernel Algorithms}, year = 2004, volume = 16, publisher = mit, pages = {1367--1374}, editor = {Sebastian Thrun and Lawrence K.~Saul and Bernhard Sch{\"o}lkopf}, address = {Cambridge, MA, USA}, abstract = {We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms.}, booktitle = nips16, location = {Vancouver, BC, Canada}, URL = {.} } @inproceedings{FraKwoRasSch04, cat = {ssl}, author = {Matthias O.~Franz and Younghee Kwon and Carl Edward Rasmussen and Bernhard Sch{\"o}lkopf}, title = {Semi-supervised kernel regression using whitened function classes}, year = 2004, volume = 3175, booktitle = lncs, publisher = {Springer}, pages = {18--26}, journal = {Pattern Recognition, Proceedings of the 26th {DAGM} Symposium}, editor = {C.~E.~Rasmussen and H.~H.~B{\"u}lthoff and M.~A.~Giese and B.~Sch{\"o}lkopf}, address = {Berlin, Germany}, abstract = {The use of non-orthonormal basis functions in ridge regression leads to an often undesired non-isotropic prior in function space. In this study, we investigate an alternative regularization technique that results in an implicit whitening of the basis functions by penalizing directions in function space with a large prior variance. The regularization term is computed from unlabelled input data that characterizes the input distribution. Tests on two datasets using polynomial basis functions showed an improved average performance compared to standard ridge regression.}, url = {.} } @inproceedings{FreFerHam14, cat = {bioinf mcmc gm}, author = {Jes Frellsen and Thomas Hamelryck and Jesper Ferkinghoff-Borg}, title = {Combining the multicanonical ensemble with generative probabilistic models of local biomolecular structure}, year = 2014, booktitle = {Proceedings of the 59th {W}orld {S}tatistics {C}ongress of the {I}nternational {S}tatistical {I}nstitute}, address = {Hong Kong}, pages = {139--144}, abstract = {Markov chain Monte Carlo is a powerful tool for sampling complex systems such as large biomolecular structures. However, the standard Metropolis-Hastings algorithm suffers from a number of deficiencies when applied to systems with rugged free-energy landscapes. Some of these deficiencies can be addressed with the multicanonical ensemble. In this paper we will present two strategies for applying the multicanonical ensemble to distributions constructed from generative probabilistic models of local biomolecular structure. In particular, we will describe how to use the multicanonical ensemble efficiently in conjunction with the reference ratio method.}, url = {http://2013.isiproceedings.org/Files/IPS015-P4-S.pdf} } @inproceedings{FreWinGhaFer16, cat = {mcmc approx}, title = {Bayesian generalised ensemble {M}arkov chain {M}onte {C}arlo}, booktitle = aistats19, year = 2016, month = {May}, address = {Cadiz, Spain}, author = {Jes Frellsen and Ole Winther and Zoubin Ghahramani and Jesper Ferkinghoff-Borg}, abstract = {Bayesian generalised ensemble (BayesGE) is a new method that addresses two major drawbacks of standard Markov chain Monte Carlo algorithms for inference in high-dimensional probability models: inapplicability to estimate the partition function, and poor mixing properties. BayesGE uses a Bayesian approach to iteratively update the belief about the density of states (distribution of the log likelihood under the prior) for the model, with the dual purpose of enhancing the sampling efficiency and make the estimation of the partition function tractable. We benchmark BayesGE on Ising and Potts systems and show that it compares favourably to existing state-of-the-art methods.} } @phdthesis{Fri15, author = {Roger Frigola}, cat = {gp time approx}, title = {Bayesian Time Series Learning with {G}aussian Processes}, school = {University of Cambridge, Department of Engineering}, year = 2015, abstract = {The analysis of time series data is important in fields as disparate as the social sciences, biology, engineering or econometrics. In this dissertation, we present a number of algorithms designed to learn Bayesian nonparametric models of time series. The goal of these kinds of models is twofold. First, they aim at making predictions which quantify the uncertainty due to limitations in the quantity and the quality of the data. Second, they are flexible enough to model highly complex data whilst preventing overfitting when the data does not warrant complex models.

We begin with a unifying literature review on time series models based on Gaussian processes. Then, we centre our attention on the Gaussian Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation of discrete-time nonlinear state-space models. We present a novel formulation of the GP-SSM that offers new insights into its properties. We then proceed to exploit those insights by developing new learning algorithms for the GP-SSM based on particle Markov chain Monte Carlo and variational inference.

Finally, we present a filtered nonlinear auto-regressive model with a simple, robust and fast learning algorithm that makes it well suited to its application by non-experts on large datasets. Its main advantage is that it avoids the computationally expensive (and potentially difficult to tune) smoothing step that is a key part of learning nonlinear state-space models.}, address = {Cambridge, UK}, url = {.} } @inproceedings{FriCheRas14, cat = {gp np time}, title = {Variational {G}aussian Process State-Space Models}, author = {Roger Frigola and Yutian Chen and Carl E. Rasmussen}, booktitle = nips27, editor = {Z. Ghahramani and M. Welling and C. Cortes and N.D. Lawrence and K.Q. Weinberger}, year = 2014, abstract = {State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series.}, url = {http://papers.nips.cc/paper/5375-variational-gaussian-process-state-space-models.pdf} } @incollection{FriLinSchRas13, cat = {gp np time}, title = {Bayesian Inference and Learning in {G}aussian Process State-Space Models with Particle {MCMC}}, author = {Roger Frigola and Fredrik Lindsten and Thomas B. Sch\"{o}n and Carl Edward Rasmussen}, booktitle = {Advances in Neural Information Processing Systems 26}, editor = {L. Bottou and C.J.C. Burges and Z. Ghahramani and M. Welling and K.Q. Weinberger}, year = 2013, publisher = curran, abstract = {State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.}, url = {http://arxiv.org/pdf/1306.2861} } @inproceedings{FriLinSchRas14, cat = {gp np time}, title = {Identification of {G}aussian Process State-Space Models with Particle Stochastic Approximation {EM}}, author = {Roger Frigola and Fredrik Lindsten and Thomas B. Sch\"{o}n and Carl Edward Rasmussen}, booktitle = {Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC)}, year = 2014, abstract = {Gaussian process state-space models (GP-SSMs) are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional (hyper-)parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification. }, url = {http://arxiv.org/pdf/1312.4852} } @inproceedings{FriRas13, cat = {gp np time}, author = {Roger Frigola and Carl Edward Rasmussen}, booktitle = {Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on}, title = {Integrated Pre-Processing for {B}ayesian Nonlinear System Identification with {G}aussian Processes}, year = 2013, abstract = {We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.}, url = {http://arxiv.org/pdf/1303.2912} } @inproceedings{Gal2014Pitfalls, title = {Pitfalls in the use of Parallel Inference for the {D}irichlet Process}, author = {Gal, Yarin and Ghahramani, Zoubin}, booktitle = {Proceedings of the 31th International Conference on Machine Learning (ICML-14)}, url = {http://jmlr.org/proceedings/papers/v32/gal14.pdf}, cat = {np}, abstract = {Recent work done by Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013) has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it – work which has been picked-up and implemented in several different fields. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these requirements (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence, analysing the load balance for the inference and showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation. We end with suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.}, year = 2014 } @inproceedings{GalBlu13, cat = {gm np}, title = "A Systematic {B}ayesian Treatment of the {IBM} Alignment Models", author = "Yarin Gal and Phil Blunsom", year = 2013, booktitle = "Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", publisher = "Association for Computational Linguistics", url = {.}, abstract = {The dominant yet ageing IBM and HMM word alignment models underpin most popular Statistical Machine Translation implementations in use today. Though beset by the limitations of implausible independence assumptions, intractable optimisation problems, and an excess of tunable parameters, these models provide a scalable and reliable starting point for inducing translation systems. In this paper we build upon this venerable base by recasting these models in the non-parametric Bayesian framework. By replacing the categorical distributions at their core with hierarchical Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we provide a more flexible formulation and sidestep the original heuristic optimisation techniques. The resulting models are highly extendible, naturally permitting the introduction of phrasal dependencies. We present extensive experimental results showing improvements in both AER and BLEU when benchmarked against Giza++, including significant improvements over IBM model 4.} } @inproceedings{GalCheGha15, Author = {Yarin Gal and Yutian Chen and Zoubin Ghahramani}, Title = {Latent {G}aussian Processes for Distribution Estimation of Multivariate Categorical Data}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML-15)}, year = 2015, pages = {645--654}, abstract = {Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.}, url = {http://jmlr.org/proceedings/papers/v37/gala15.html}, cat = {gp np approx bioinf} } @inproceedings{GalTur15, Author = {Yarin Gal and Richard Turner}, Title = {Improving the {G}aussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML-15)}, year = 2015, pages = {655--664}, abstract = {Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random variable. The random covariance function has a posterior, on which a variational distribution is placed. The variational distribution transforms the random covariance function to fit the data. We study the properties of our approximate inference, compare it to alternative ones, and extend it to the distributed and stochastic domains. Our approximation captures complex functions better than standard approaches and avoids over-fitting.}, url = {http://jmlr.org/proceedings/papers/v37/galb15.html}, cat = {gp np approx} } @incollection{GalWilRas14, title = {Distributed Variational Inference in Sparse {G}aussian Process Regression and Latent Variable Models}, author = {Gal, Yarin and van der Wilk, Mark and Rasmussen, Carl}, booktitle = {Advances in Neural Information Processing Systems 27}, editor = {Z. Ghahramani and M. Welling and C. Cortes and N.D. Lawrence and K.Q. Weinberger}, pages = {3257--3265}, abstract = {Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.}, year = 2014, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/5593-distributed-variational-inference-in-sparse-gaussian-process-regression-and-latent-variable-models}, cat = {gp np} } @article{Gha01a, author = {Ghahramani, Zoubin}, cat = {time, gm}, journal = {IJPRAI}, number = 1, pages = {9-42}, title = {An Introduction to Hidden {M}arkov Models and {B}ayesian Networks}, url = {.}, volume = 15, year = 2001, abstract = {We provide a tutorial on learning and inference in hidden Markov models in the context of the recent literature on Bayesian networks. This perspective make sit possible to consider novel generalizations to hidden Markov models with multiple hidden state variables, multiscale representations, and mixed discrete and continuous variables. Although exact inference in these generalizations is usually intractable, one can use approximate inference in these generalizations is usually intractable, one can use approximate inference algorithms such as Markov chain sampling and variational methods. We describe how such methods are applied to these generalized hidden Markov models. We conclude this review with a discussion of Bayesian methods for model selection in generalized HMMs. } } @inproceedings{Gha03a, author = {Ghahramani, Zoubin}, booktitle = {Advanced Lectures on Machine Learning}, cat = {review}, editor = {Bousquet, Olivier and von Luxburg, Ulrike and R{\"a}tsch, Gunnar}, isbn = {3-540-23122-6}, pages = {72-112}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, title = {Unsupervised {L}earning}, url = {.}, volume = 3176, year = 2003, abstract = {We give a tutorial and overview of the field of unsupervised learning from the perspective of statistical modelling. Unsupervised learning can be motivated from information theoretic and Bayesian principles. We briefly review basic models in unsupervised learning, including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov models, state-space models, and many variants and extensions. We derive the EM algorithm and give an overview of fundamental concepts in graphical models, and inference algorithms on graphs. This is followed by a quick tour of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC), Laplace approximation, BIC, variational approximations, and expectation propagation (EP). The aim of this chapter is to provide a high-level view of the field. Along the way, many state-of-the-art ideas and future directions are also reviewed. } } @article{Gha12, cat = {np review}, author = {Zoubin Ghahramani}, title = {{B}ayesian nonparametrics and the probabilistic approach to modelling}, journal = {Philosophical Transactions of the Royal Society A}, url = {.}, year = 2013, abstract = {Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian nonparametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics for modelling unknown functions, density estimation, clustering, time series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman's coalescent, Dirichlet diffusion tress, and Wishart processes.} } @article{Gha15, cat = {np review}, author = {Zoubin Ghahramani}, title = {Probabilistic machine learning and artificial intelligence}, journal = {Nature}, url = {http://www.nature.com/nature/journal/v521/n7553/full/nature14541.html}, year = 2015, volume = 521, pages = {452–459}, doi = {doi:10.1038/nature14541}, abstract = {How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.} } @inproceedings{Gha94a, author = {Ghahramani, Zoubin}, booktitle = nips7, cat = {clust}, editor = {Tesauro, Gerald and Touretzky, David S. and Leen, Todd K.}, pages = {617-624}, publisher = {MIT Press}, title = {Factorial Learning and the {EM} Algorithm}, url = {.}, year = 1994, abstract = {Many real world learning problems are best characterized by an interaction of multiple independent causes or factors. Discovering such causal structure from the data is the focus of this paper. Based on Zemel and Hinton's cooperative vector quantizer (CVQ) architecture, an unsupervised learning algorithm is derived from the Expectation--Maximization (EM) framework. Due to the combinatorial nature of the data generation process, the exact E-step is computationally intractable. Two alternative methods for computing the E-step are proposed: Gibbs sampling and mean-field approximation, and some promising empirical results are presented. } } @inproceedings{Gha97a, author = {Ghahramani, Zoubin}, booktitle = {Adaptive Processing of Sequences and Data Structures}, cat = {time, gm}, editor = {Giles, C. Lee and Gori, Marco}, isbn = {3-540-64341-9}, pages = {168-197}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, title = {Learning Dynamic {B}ayesian Networks}, url = {.}, volume = 1387, year = 1997, abstract = {Bayesian networks are a concise graphical formalism for describing probabilistic models. We have provided a brief tutorial of methods for learning and inference in dynamic Bayesian networks. In many of the interesting models, beyond the simple linear dynamical system or hidden Markov model, the calculations required for inference are intractable. Two different approaches for handling this intractability are Monte Carlo methods such as Gibbs sampling, and variational methods. An especially promising variational approach is based on exploiting tractable substructures in the Bayesian network. } } @inproceedings{GhaBea00a, author = {Ghahramani, Zoubin and Beal, Matthew J.}, booktitle = {NIPS}, cat = {approx}, editor = {Kearns, Michael J. and Solla, Sara A. and Cohn, David A.}, isbn = {0-262-11245-0}, pages = {507-513}, publisher = {The MIT Press}, title = {Propagation Algorithms for Variational {B}ayesian Learning}, url = {.}, year = 2000, abstract = {Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. } } @inproceedings{GhaBea99a, author = {Ghahramani, Zoubin and Beal, Matthew J.}, booktitle = {NIPS}, editor = {Kearns, Michael J. and Solla, Sara A. and Cohn, David A.}, isbn = {0-262-11245-0}, pages = {449-455}, publisher = {The MIT Press}, title = {Variational Inference for {B}ayesian Mixtures of Factor Analysers}, url = {.}, year = 1999, abstract = {We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to overfitting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples. By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exact predictive density, and the KL divergence between the variational posterior and the true posterior, not only in this model but for variational approximations in general.} } @inproceedings{GhaGriSol07, cat = {np}, month = {July}, author = {Z. Ghahramani and T.L. Griffiths and P. Sollich}, annote = {Includes discussion by David Dunson, and rejoinder.}, booktitle = {Bayesian Statistics 8}, editor = {J.M. Bernardo and M.J. Bayarri and J.O. Berger and A.P. Dawid and D. Heckerman and A.F.M. Smith and M. West}, address = {Oxford, UK}, title = {Bayesian nonparametric latent feature models (with discussion)}, publisher = oup, pages = {201--226}, year = 2007, url = {.}, abstract = {We describe a flexible nonparametric approach to latent variable modelling in which the number of latent variables is unbounded. This approach is based on a probability distribution over equivalence classes of binary matrices with a finite number of rows, corresponding to the data points, and an unbounded number of columns, corresponding to the latent variables. Each data point can be associated with a subset of the possible latent variables, which we refer to as the latent features of that data point. The binary variables in the matrix indicate which latent feature is possessed by which data point, and there is a potentially infinite array of features. We derive the distribution over unbounded binary matrices by taking the limit of a distribution over N×K binary matrices as K→∞. We define a simple generative processes for this distribution which we call the Indian buffet process (IBP; Griffiths and Ghahramani, 2005, 2006) by analogy to the Chinese restaurant process (Aldous, 1985; Pitman, 2002). The IBP has a single hyperparameter which controls both the number of feature per ob ject and the total number of features. We describe a two-parameter generalization of the IBP which has additional flexibility, independently controlling the number of features per object and the total number of features in the matrix. The use of this distribution as a prior in an infinite latent feature model is illustrated, and Markov chain Monte Carlo algorithms for inference are described.} } @inproceedings{GhaHel06, cat = {ir clust}, author = {Zoubin Ghahramani and Katherine A. Heller}, title = {{B}ayesian Sets}, booktitle = nips18, year = 2006, month = {December}, address = {Cambridge, MA, USA}, publisher = mit, editor = {Y. Weiss and B. Sch\"{o}lkopf and J. Platt}, pages = {435--442}, url = {.}, abstract = {Inspired by "Google™ Sets", we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a model-based concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions.} } @article{GhaHin00a, author = {Ghahramani, Zoubin and Hinton, Geoffrey E.}, cat = {approx, time}, journal = {Neural Computation}, number = 4, pages = {831-864}, title = {Variational Learning for Switching State-Space Models}, url = {.}, volume = 12, year = 2000, abstract = {We introduce a new statistical model for time series which iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time series models---hidden Markov models and linear dynamical systems---and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs et al, 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact Expectation Maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log likelihood and makes use of both the forward--backward recursions for hidden Markov models and the Kalman filter recursions for linear dynamical systems. We tested the algorithm both on artificial data sets and on a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models.} } @inproceedings{GhaHin97a, author = {Ghahramani, Zoubin and Hinton, Geoffrey E.}, booktitle = {NIPS}, cat = {deep}, editor = {Jordan, Michael I. and Kearns, Michael J. and Solla, Sara A.}, isbn = {0-262-10076-2}, publisher = {The MIT Press}, title = {Hierarchical Non-linear Factor Analysis and Topographic Maps}, url = {.}, year = 1997, abstract = {We first describe a hierarchical, generative model that can be viewed as a non-linear generalisation of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We then show how to incorporate lateral connections into the generative model. The model extracts a sparse, distributed, hierarchical representation of depth from simplified random-dot stereograms and the localised disparity detectors in the first hidden layer form a topographic map. When presented with image patches from natural scenes, the model develops topographically organised local feature detectors.} } @inproceedings{GhaJor93a, author = {Ghahramani, Zoubin and Jordan, Michael I.}, booktitle = {NIPS}, cat = {clust}, editor = {Cowan, Jack D. and Tesauro, Gerald and Alspector, Joshua}, isbn = {1-55860-322-0}, pages = {120-127}, publisher = {Morgan Kaufmann}, title = {Supervised learning from incomplete data via an {EM} approach}, url = {.}, year = 1993, abstract = {Real-world learning tasks may involve high-dimensional data sets with arbitrary patterns of missing data. In this paper we present a framework based on maximum likelihood density estimation for learning from such data sets. We use mixture models for the density estimates and make two distinct appeals to the ExpectationMaximization (EM) principle (Dempster et al., 1977) in deriving a learning algorithm---EM is used both for the estimation of mixture components and for coping with missing data. The resulting algorithm is applicable to a wide range of supervised as well as unsupervised learning problems. Results from a classification benchmark---the iris data set---are presented.} } @inproceedings{GhaJor95a, author = {Ghahramani, Zoubin and Jordan, Michael I.}, booktitle = {NIPS}, cat = {time}, editor = {Touretzky, David S. and Mozer, Michael and Hasselmo, Michael E.}, isbn = {0-262-20107-0}, pages = {472-478}, publisher = {MIT Press}, title = {Factorial Hidden {M}arkov Models}, url = {.}, year = 1995, abstract = {Hidden Markov models (HMMs) have proven to be one of the most widely used tools for learning probabilistic models of time series data. In an HMM, information about the past is conveyed through a single discrete variable---the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner. We describe an exact algorithm for inferring the posterior probabilities of the hidden state variables given the observations, and relate it to the forward--backward algorithm for HMMs and to algorithms for more general graphical models. Due to the combinatorial nature of the hidden state representation, this exact algorithm is intractable. As in other intractable systems, approximate inference can be carried out using Gibbs sampling or variational methods. Within the variational framework, we present a structured approximation in which the the state variables are decoupled, yielding a tractable algorithm for learning the parameters of the model. Empirical comparisons suggest that these approximations are efficient and provide accurate alternatives to the exact methods. Finally, we use the structured approximation to model Bach's chorales and show that factorial HMMs can capture statistical structure in this data set which an unconstrained HMM cannot.} } @article{GhaJor97a, author = {Ghahramani, Zoubin and Jordan, Michael I.}, cat = {time}, journal = {Machine Learning}, number = {2-3}, pages = {245-273}, title = {Factorial Hidden {M}arkov Models}, url = {.}, volume = 29, year = 1997, abstract = {Hidden Markov models (HMMs) have proven to be one of the most widely used tools for learning probabilistic models of time series data. In an HMM, information about the past is conveyed through a single discrete variable---the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner. We describe an exact algorithm for inferring the posterior probabilities of the hidden state variables given the observations, and relate it to the forward--backward algorithm for HMMs and to algorithms for more general graphical models. Due to the combinatorial nature of the hidden state representation, this exact algorithm is intractable. As in other intractable systems, approximate inference can be carried out using Gibbs sampling or variational methods. Within the variational framework, we present a structured approximation in which the the state variables are decoupled, yielding a tractable algorithm for learning the parameters of the model. Empirical comparisons suggest that these approximations are efficient and provide accurate alternatives to the exact methods. Finally, we use the structured approximation to model Bach's chorales and show that factorial HMMs can capture statistical structure in this data set which an unconstrained HMM cannot.} } @inproceedings{GhaKorHin99a, author = {Ghahramani, Zoubin and Korenberg, Alexander T and Hinton, Geoffrey E}, booktitle = {Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ. No. 470)}, cat = {deep}, organization = {IET}, pages = {13--18}, title = {Scaling in a hierarchical unsupervised network}, volume = 1, year = 1999, url = {.}, abstract = {A persistent worry with computational models of unsupervised learning is that learning will become more difficult as the problem is scaled. We examine this issue in the context of a novel hierarchical, generative model that can be viewed as a non-linear generalization of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We first demonstrate that the model can extract a sparse, distributed, hierarchical representation of global disparity from simplified random-dot stereograms. We then investigate some of the scaling properties of the algorithm on this problem and find that : (1) increasing the image size leads to faster and more reliable learning; (2) Increasing the depth of the network from one to two hidden layers leads to better representations at the first hidden layer, and (3) Once one part of the network has discovered how to represent disparity, it 'supervises' other parts of the network, greatly speeding up their learning.} } @inproceedings{GhaRow98a, author = {Ghahramani, Zoubin and Roweis, Sam T.}, booktitle = {NIPS}, cat = {time, rl}, editor = {Kearns, Michael J. and Solla, Sara A. and Cohn, David A.}, isbn = {0-262-11245-0}, pages = {431-437}, publisher = {The MIT Press}, title = {Learning Nonlinear Dynamical Systems Using an {EM} Algorithm}, url = {.}, year = 1998, abstract = {The Expectation Maximization (EM) algorithm is an iterative procedure for maximum likelihood parameter estimation from data sets with missing or hidden variables. It has been applied to system identification in linear stochastic state-space models, where the state variables are hidden from the observer and both the state and the parameters of the model have to be estimated simultaneously [9]. We present a generalization of the EM algorithm for parameter estimation in nonlinear dynamical systems. The ``expectation'' step makes use of Extended Kalman Smoothing to estimate the state, while the ``maximization'' step re-estimates the parameters using these uncertain state estimates. In general, the nonlinear maximization step is difficult because it requires integrating out the uncertainty in the states. However, if Gaussian radial basis function (RBF) approximators are used to model the nonlinearities, the integrals become tractable and the maximization step can be solved via systems of linear equations. } } @inproceedings{GhaWolJor94a, author = {Ghahramani, Zoubin and Wolpert, Daniel M. and Jordan, Michael I.}, booktitle = nips7, cat = {neuro}, editor = {Tesauro, Gerald and Touretzky, David S. and Leen, Todd K.}, pages = {1125-1132}, publisher = {MIT Press}, title = {Computational Structure of coordinate transformations: A generalization study}, url = {.}, year = 1994, abstract = {One of the fundamental properties that both neural networks and the central nervous system share is the ability to learn and generalize from examples. While this property has been studied extensively in the neural network literature it has not been thoroughly explored in human perceptual and motor learning. We have chosen a coordinate transformation system---the visuomotor map which transforms visual coordinates into motor coordinates---to study the generalization effects of learning new input--output pairs. Using a paradigm of computer controlled altered visual feedback, we have studied the generalization of the visuomotor map subsequent to both local and context-dependent remappings. A local remapping of one or two input-output pairs induced a significant global, yet decaying, change in the visuomotor map, suggesting a representation for the map composed of units with large functional receptive fields. Our study of context-dependent remappings indicated that a single point in visual space can be mapped to two different finger locations depending on a context variable---the starting point of the movement. Furthermore, as the context is varied there is a gradual shift between the two remappings, consistent with two visuomotor modules being learned and gated smoothly with the context.} } @article{GibSaa08, author = {Richard J. Gibbens and Yunus Saat\c{c}i}, title = {Data, modelling and inference in road traffic networks}, journal = {Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences}, year = 2008, volume = 366, pages = {1907--1919}, number = 1872, month = {June}, abstract = {In this paper, we study UK road traffic data and explore a range of modelling and inference questions that arise from them. For example, loop detectors on the M25 motorway record speed and flow measurements at regularly spaced locations as well as the entry and exit lanes of junctions. An exploratory study of these data helps us to better understand and quantify the nature of congestion on the road network. From a traveller's perspective it is crucially important to understand the overall journey times and we look at methods to improve our ability to predict journey times given access jointly to both real-time and historical loop detector data. Throughout this paper we will comment on related work derived from US freeway data.}, doi = {10.1098/rsta.2008.0020}, eprint = {http://rsta.royalsocietypublishing.org/content/366/1872/1907.full.pdf+html}, url = {http://rsta.royalsocietypublishing.org/content/366/1872/1907.abstract} } @inproceedings{GilSaaCun13, cat = {approx gp}, author = {E.~Gilboa and Yunus Saat\c{c}i and John P.~Cunningham}, title = {Scaling Multidimensional {G}aussian Processes using Projected Additive Approximations}, year = 2013, booktitle = icml30, abstract = {Exact Gaussian Process (GP) regression has O(N

While one strategy to overcome this problem is to design novel efficient algorithms, the other is to 'reduce' the size of the large graph by sampling. This is the scope of this paper: We will present novel Metropolis algorithms for sampling a 'representative' small subgraph from the original large graph, with 'representative' describing the requirement that the sample shall preserve crucial graph properties of the original graph. In our experiments, we improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by producing representative subgraph samples that are both smaller and of higher quality than those produced by other methods from the literature.} } @inproceedings{HusDuv2012, cat = {np gp mcmc active approx}, author = {Ferenc Husz\'{a}r and David Duvenaud}, title = {Optimally-Weighted Herding is {B}ayesian Quadrature}, booktitle = uai28, address = {Catalina Island, California}, abstract = {Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.}, month = {July}, year = 2012, pages = {377--385}, url = {http://arxiv.org/pdf/1204.1664.pdf} } @article{HusHou11, cat = {active}, title = {Adaptive {Bayesian} Quantum Tomography}, author = {Ferenc Husz\'{a}r and Neil Houlsby}, journal = {Physical Review A}, volume = 85, number = 5, pages = 052120, year = 2012, publisher = {APS}, url = {http://pra.aps.org/abstract/PRA/v85/i5/e052120}, abstract = {In this paper we revisit the problem of optimal design of quantum tomographic experiments. In contrast to previous approaches where an optimal set of measurements is decided in advance of the experiment, we allow for measurements to be adaptively and efficiently re-optimised depending on data collected so far. We develop an adaptive statistical framework based on Bayesian inference and Shannon's information, and demonstrate a ten-fold reduction in the total number of measurements required as compared to non-adaptive methods, including mutually unbiased bases.} } @techreport{HusLac11, title = {A Kernel Approach to Tractable {Bayesian} Nonparametrics}, author = {Ferenc Husz\'{a}r and Simon Lacoste-Julien}, year = 2011, institution = {University of Cambridge}, annote = {arXiv:1103.1761}, url = {.}, abstract = { Inference in popular nonparametric Bayesian models typically relies on sampling or other approximations. This paper presents a general methodology for constructing novel tractable nonparametric Bayesian methods by applying the kernel trick to inference in a parametric Bayesian model. For example, Gaussian process regression can be derived this way from Bayesian linear regression. Despite the success of the Gaussian process framework, the kernel trick is rarely explicitly considered in the Bayesian literature. In this paper, we aim to fill this gap and demonstrate the potential of applying the kernel trick to tractable Bayesian parametric models in a wider context than just regression. As an example, we present an intuitive Bayesian kernel machine for density estimation that is obtained by applying the kernel trick to a Gaussian generative model in feature space.} } @inproceedings{HusNopLen10, author = {Ferenc Husz\'{a}r and Uta Noppeney and M\'{a}t\'{e} Lengyel}, title = {Mind reading by machine learning: {A} doubly {Bayesian} method for inferring mental representations}, booktitle = cogsci32, year = 2010, month = {August}, editor = {S. Ohlsson and R. Catrambone}, publisher = {The Cognitive Science Society}, address = {Austin, TX, USA}, abstract = {A central challenge in cognitive science is to measure and quantify the mental representations humans develop --- in other words, to 'read' subject's minds. In order to eliminate potential biases in reporting mental contents due to verbal elaboration, subjects' responses in experiments are often limited to binary decisions or discrete choices that do not require conscious reflection upon their mental contents. However, it is unclear what such impoverished data can tell us about the potential richness and dynamics of subjects' mental representations. To address this problem, we used ideal observer models that formalise choice behaviour as (quasi-)Bayes-optimal, given subjects' representations in long-term memory, acquired through prior learning, and the stimuli currently available to them. Bayesian inversion of such ideal observer models allowed us to infer subjects' mental representation from their choice behaviour in a variety of psychophysical tasks. The inferred mental representations also allowed us to predict future choices of subjects with reasonable accuracy, even in tasks that were different from those in which the representations were estimated. These results demonstrate a significant potential in standard binary decision tasks to recover detailed information about subjects' mental representations}, url = {.}, annote = {Supplementary material available here.} } @inproceedings{IwaDuvGha12, cat = {np gp clust}, author = {Tomoharu Iwata and David Duvenaud and Zoubin Ghahramani}, title = {Warped Mixtures for Nonparametric Cluster Shapes}, booktitle = uai29, address = {Bellevue, Washington}, abstract = {A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.}, month = {July}, year = 2013, url = {http://arxiv.org/pdf/1206.1846} } @inproceedings{IwaHouGha13, cat = {active}, author = {Tomoharu Iwata and Neil Houlsby and Zoubin Ghahramani}, year = 2013, title = {Active Learning for Interactive Visualization}, booktitle = aistats16, abstract = {Many automatic visualization methods have been proposed. However, a visualization that is automatically generated might be different to how a user wants to arrange the objects in visualization space. By allowing users to re-locate objects in the embedding space of the visualization, they can adjust the visualization to their preference. We propose an active learning framework for interactive visualization which selects objects for the user to re-locate so that they can obtain their desired visualization by re-locating as few as possible. The framework is based on an information theoretic criterion, which favors objects that reduce the uncertainty of the visualization. We present a concrete application of the proposed framework to the Laplacian eigenmap visualization method. We demonstrate experimentally that the proposed framework yields the desired visualization with fewer user interactions than existing methods.}, url = {http://jmlr.org/proceedings/papers/v31/iwata13a.html} } @article{IwaLloGha15, cat = {clust network ssl np}, title = {Unsupervised Many-to-Many Object Matching for Relational Data}, author = {Tomoharu Iwata and James Robert Lloyd and Zoubin Ghahramani}, abstract = {We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. The effectiveness of the proposed method is experimentally demonstrated by using synthetic and real relational data sets, which include applications to cross-domain recommendation without shared user/item identifiers and multi-lingual word clustering.}, url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7208879&tag=1}, journal = PAMI, year = 2015 } @inproceedings{IwaShaGha13a, author = {Iwata, Tomoharu and Shah, Amar and Ghahramani, Zoubin}, booktitle = {KDD}, cat = {network}, isbn = {978-1-4503-2174-7}, pages = {266-274}, publisher = acm, title = {Discovering latent influence in online social activities via shared cascade poisson processes}, url = {.}, year = 2013, abstract = {Many people share their activities with others through online communities. These shared activities have an impact on other users' activities. For example, users are likely to become interested in items that are adopted (e.g. liked, bought and shared) by their friends. In this paper, we propose a probabilistic model for discovering latent influence from sequences of item adoption events. An inhomogeneous Poisson process is used for modeling a sequence, in which adoption by a user triggers the subsequent adoption of the same item by other users. For modeling adoption of multiple items, we employ multiple inhomogeneous Poisson processes, which share parameters, such as influence for each user and relations between users. The proposed model can be used for finding influential users, discovering relations between users and predicting item popularity in the future. We present an efficient Bayesian inference procedure of the proposed model based on the stochastic EM algorithm. The effectiveness of the proposed model is demonstrated by using real data sets in a social bookmark sharing service.} } @inproceedings{JinGha02a, author = {Jin, Rong and Ghahramani, Zoubin}, booktitle = {NIPS}, editor = {Becker, Suzanna and Thrun, Sebastian and Obermayer, Klaus}, isbn = {0-262-02550-7}, pages = {897-904}, publisher = {MIT Press}, title = {Learning with Multiple Labels}, url = {.}, year = 2002, abstract = {In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, na{\"\i}ve methods degrade rapidly as more ambiguity is introduced into the labels. } } @article{JorGhaJaa99a, author = {Jordan, Michael I. and Ghahramani, Zoubin and Jaakkola, Tommi and Saul, Lawrence K.}, cat = {approx, review}, journal = {Machine Learning}, number = 2, pages = {183-233}, title = {An Introduction to Variational Methods for Graphical Models}, url = {.}, volume = 37, year = 1999, abstract = {This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.} } @inproceedings{JorGhaSau96a, author = {Jordan, Michael I. and Ghahramani, Zoubin and Saul, Lawrence K.}, booktitle = {NIPS}, cat = {time}, editor = {Mozer, Michael and Jordan, Michael I. and Petsche, Thomas}, pages = {501-507}, publisher = {MIT Press}, title = {Hidden {M}arkov Decision Trees}, url = {.}, year = 1996, abstract = {We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.} } @inproceedings{KasSobHer12, author = {Michael Kaschesky and Pawel Sobkowicz and Jos\'e Miguel Hern\'andez-Lobato and Guillaume Bouchard and Cedric Archambeau and Nicolas Scharioth and Robert Manchin and Adrian Gschwend and Reinhard Riedl}, title = {Bringing Representativeness into Social Media Monitoring and Analysis}, booktitle = {46th {Hawaii} International Conference on System Sciences}, year = 2013, address = {Manoa, Hawaii}, url = {http://www.computer.org/csdl/proceedings/hicss/2013/4892/00/4892c003.pdf}, abstract = {The opinions, expectations and behavior of citizens are increasingly reflected online --- therefore, mining the internet for such data can enhance decision-making in public policy, communications, marketing, finance and other fields. However, to come closer to the representativeness of classic opinion surveys there is a lack of knowledge about the sociodemographic characteristics of those voicing opinions on the internet. This paper proposes to calibrate online opinions aggregated from multiple and heterogeneous data sources with traditional surveys enhanced with rich socio-demographic information to enable insights into which opinions are expressed on the internet by specific segments of society. The goal of this research is to provide professionals in citizen- and consumer- centered domains with more concise near real-time intelligence on online opinions. To become effective, the methodologies presented in this paper must be integrated into a coherent decision support system.} } @inproceedings{KasVanGraHer10, author = {Kasneci, G. and {Van Gael}, J. and Graepel, T. and Herbrich, R.}, booktitle = {European Conference on Machine Learning (ECML)}, title = {Bayesian Knowledge Corroboration with Logical Rules and User Feedback}, year = 2010, month = {September}, address = {Barcelona, Spain}, url = {.}, abstract = {Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS [1]. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.} } @article{KerFreLinKro14, cat = {binf}, author = {Peter Kerpedjiev and Jes Frellsen and Stinus Lindgreen and Anders Krogh}, title = {Adaptable probabilistic mapping of short reads using position specific scoring matrices}, journal = {BMC bioinformatics}, volume = 15, pages = 100, year = 2014, abstract = {BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases.RESULTS:We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D.~melanogaster) to a related genome (D.~simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.}, doi = {10.1186/1471-2105-15-100}, annote = {Peter Kerpedjiev and Jes Frellsen contributed equally. Additional resources are available at bwa-pssm.binf.ku.dk} } @article{KimGha06a, author = {Kim, Hyun-Chul and Ghahramani, Zoubin}, cat = {gp}, journal = {IEEE Trans. Pattern Anal. Mach. Intell.}, number = 12, pages = {1948-1959}, title = {Bayesian {G}aussian Process Classification with the {EM}-{EP} Algorithm}, url = {.}, volume = 28, year = 2006, abstract = {Gaussian process classifiers (GPCs) are Bayesian probabilistic kernel classifiers. In GPCs, the probability of belonging to a certain class at an input location is monotonically related to the value of some latent function at that location. Starting from a Gaussian process prior over this latent function, data are used to infer both the posterior over the latent function and the values of hyperparameters to determine various aspects of the function. Recently, the expectation propagation (EP) approach has been proposed to infer the posterior over the latent function. Based on this work, we present an approximate EM algorithm, the EM-EP algorithm, to learn both the latent function and the hyperparameters. This algorithm is found to converge in practice and provides an efficient Bayesian framework for learning hyperparameters of the kernel. A multiclass extension of the EM-EP algorithm for GPCs is also derived. In the experimental results, the EM-EP algorithms are as good or better than other methods for GPCs or Support Vector Machines (SVMs) with cross-validation. } } @inproceedings{KimGha08, cat = {gp}, volume = 5342, month = {December}, author = {H.~Kim and Zoubin Ghahramani}, series = lncs, booktitle = {Structural, Syntactic and Statistical Pattern Recognition}, editor = {L. Niels da Vitoria}, title = {Outlier robust {Gaussian} process classification}, address = {Berlin, Germany}, publisher = {Springer Berlin / Heidelberg}, year = 2008, journal = lncs, pages = {896--905}, url = {.}, abstract = {Gaussian process classifiers (GPCs) are a fully statistical model for kernel classification. We present a form of GPC which is robust to labeling errors in the data set. This model allows label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We derive an outlier robust algorithm for training this model which alternates iterations based on the EP approximation and hyperparameter updates until convergence. We show the usefulness of the proposed algorithm with model selection method through simulation results.} } @inproceedings{KimGha08a, author = {Kim, Hyun-Chul and Ghahramani, Zoubin}, booktitle = {SSPR/SPR}, cat = {gp}, editor = {da Vitoria Lobo, Niels and Kasparis, Takis and Roli, Fabio and Kwok, James Tin-Yau and Georgiopoulos, Michael and Anagnostopoulos, Georgios C. and Loog, Marco}, isbn = {978-3-540-89688-3}, pages = {896-905}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, title = {Outlier Robust {G}aussian Process Classification}, url = {.}, volume = 5342, year = 2008, abstract = {Gaussian process classifiers (GPCs) are a fully statistical model for kernel classification. We present a form of GPC which is robust to labeling errors in the data set. This model allows label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We derive an outlier robust algorithm for training this model which alternates iterations based on the EP approximation and hyperparameter updates until convergence. We show the usefulness of the proposed algorithm with model selection method through simulation results. } } @inproceedings{KimGha12, author = {Hyun-Chul Kim and Zoubin Ghahramani}, year = 2012, title = {{B}ayesian Classifier Combination}, booktitle = aistats15, abstract = {Bayesian model averaging linearly mixes the probabilistic predictions of multiple models, each weighted by its posterior probability. This is the coherent Bayesian way of combining multiple models only under certain restrictive assumptions, which we outline. We explore a general framework for Bayesian model combination (which differs from model averaging) in the context of classification. This framework explicitly models the relationship between each model’s output and the unknown true label. The framework does not require that the models be probabilistic (they can even be human assessors), that they share prior information or receive the same training data, or that they be independent in their errors. Finally, the Bayesian combiner does not need to believe any of the models is in fact correct. We test several variants of this classifier combination procedure starting from a classic statistical model proposed by Dawid and Skene (1979) and using MCMC to add more complex but important features to the model. Comparisons on sev- eral data sets to simpler methods like majority voting show that the Bayesian methods not only perform well but result in interpretable diagnostics on the data points and the models.}, url = {.} } @article{KimKimGha06a, author = {Kim, Hyun-Chul and Kim, Daijin and Ghahramani, Zoubin and Bang, Sung Yang}, cat = {gp}, journal = {Pattern Recognition Letters}, number = 6, pages = {618-626}, title = {Appearance-based gender classification with {G}aussian processes}, url = {.}, volume = 27, year = 2006, abstract = {This paper concerns the gender classification task of discriminating between images of faces of men and women from face images. In appearance-based approaches, the initial images are preprocessed (e.g. normalized) and input into classifiers. Recently, support vector machines (SVMs) which are popular kernel classifiers have been applied to gender classification and have shown excellent performance. SVMs have difficulty in determining the hyperparameters in kernels (using cross-validation). We propose to use Gaussian process classifiers (GPCs) which are Bayesian kernel classifiers. The main advantage of GPCs over SVMs is that they determine the hyperparameters of the kernel based on Bayesian model selection criterion. The experimental results show that our methods outperformed SVMs with cross-validation in most of data sets. Moreover, the kernel hyperparameters found by GPCs using Bayesian methods can be used to improve SVM performance.} } @inproceedings{KimKimGha06b, author = {Kim, Hyun-Chul and Kim, Daijin and Ghahramani, Zoubin and Bang, Sung Yang}, booktitle = {IJCNN}, editor = {Cohen, William W. and Moore, Andrew}, isbn = {1-59593-383-2}, pages = {3371-3376}, publisher = acm, series = {ACM International Conference Proceeding Series}, title = {Gender Classification with {B}ayesian Kernel Methods}, url = {.}, volume = 148, year = 2006, abstract = {We consider the gender classification task of discriminating between images of faces of men and women from face images. In appearance-based approaches, the initial images are preprocessed (e.g. normalized) and input into classifiers. Recently, SVMs which are popular kernel classifiers have been applied to gender classification and have shown excellent performance. We propose to use one of Bayesian kernel methods which is Gaussian process classifiers (GPCs) for gender classification. The main advantage of Bayesian kernel methods such as GPCs over SVMs is that they determine the hyperparameters of the kernel based on Bayesian model selection criterion. Our results show that GPCs outperformed SVMs with cross validation. } } @article{KirGriSav12a, author = {Kirk, Paul D. W. and Griffin, Jim E. and Savage, Richard S. and Ghahramani, Zoubin and Wild, David L.}, cat = {np, bioinf}, journal = {Bioinformatics}, number = 24, pages = {3290-3297}, title = {Bayesian Correlated clustering to integrate multiple datasets}, url = {.}, volume = 28, year = 2012, abstract = {MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. RESULTS: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI's performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.} } @article{KirGriSavGhaetal12, cat = {bioinf np clust}, author = {P. Kirk and J. E. Griffin and R. S. Savage and Z. Ghahramani and D. L. Wild}, year = 2012, title = {{B}ayesian correlated clustering to integrate multiple datasets}, journal = {Bioinformatics}, abstract = {Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets.

Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods.}, annote = {This paper is available from the Bioinformatics site and a Matlab implementation of MDI is available fromthis site.}, url = {.} } @inproceedings{KnoGaeGha11, cat = {np approx}, author = {David A.~Knowles and Jurgen Van Gael and Zoubin Ghahramani}, title = {Message Passing Algorithms for the {D}irichlet Diffusion Tree}, booktitle = icml28, year = 2011, abstract = {We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.}, url = {.}, annote = {web site} } @inproceedings{KnoGha07, cat = {np bioinf}, url = {.}, author = {David Knowles and Zoubin Ghahramani}, title = {Infinite Sparse Factor Analysis and Infinite Independent Components Analysis}, booktitle = {7th International Conference on Independent Component Analysis and Signal Separation}, year = 2007, month = {September}, address = {London, UK}, publisher = {Springer}, pages = {381--388}, doi = {10.1007/978-3-540-74494-8_48}, abstract = {A nonparametric Bayesian extension of Independent Components Analysis (ICA) is proposed where observed data Y is modelled as a linear superposition, G, of a potentially infinite number of hidden sources, X. Whether a given source is active for a specific data point is specified by an infinite binary matrix, Z. The resulting sparse representation allows increased data reduction compared to standard ICA. We define a prior on Z using the Indian Buffet Process (IBP). We describe four variants of the model, with Gaussian or Laplacian priors on X and the one or two-parameter IBPs. We demonstrate Bayesian inference under these models using a Markov chain Monte Carlo (MCMC) algorithm on synthetic and gene expression data and compare to standard ICA algorithms.} } @inproceedings{KnoGha11a, cat = {np clust}, author = {David A.~Knowles and Zoubin Ghahramani}, title = {Pitman-{Y}or Diffusion Trees}, booktitle = uai27, year = 2011, abstract = {We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.}, url = {.}, annote = {web site} } @article{KnoGha11b, cat = {np bioinf}, author = {David A. Knowles and Zoubin Ghahramani}, year = 2011, title = {Nonparametric {B}ayesian Sparse Factor Models with application to Gene Expression modelling.}, journal = {Annals of Applied Statistics}, volume = 5, number = {2B}, pages = {1534--1552}, url = {.}, abstract = {A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model's utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.} } @inproceedings{KnoHol09, cat = {bioinf}, author = {David A. Knowles and Susan Holmes}, title = {Statistical tools for ultra-deep pyrosequencing of fast evolving viruses}, booktitle = {NIPS Workshop: Computational Biology}, year = 2009, abstract = {We aim to detect minor variant Hepatitis B viruses (HBV) in 38 pyrosequencing samples from infected individuals. Errors involved in the amplification and ultra deep pyrosequencing (UDPS) of these samples are characterised using HBV plasmid controls. Homopolymeric regions and quality scores are found to be significant covariates in determining insertion and deletion (indel) error rates, but not mismatch rates which depend on the nucleotide transition matrix. This knowledge is used to derive two methods for classifying genuine mutations: a hypothesis testing framework and a mixture model. Using an approximate "ground truth" from a limiting dilution Sanger sequencing run, these methods are shown to outperform the naive percentage threshold approach. The possibility of early stage PCR errors becoming significant is investigated by simulation, which underlines the importance of the initial copy number.}, url = {.}, annote = {web site} } @inproceedings{KnoMin11, cat = {np clust}, author = {David A.~Knowles and Thomas P.~Minka}, title = {Non-conjugate Variational Message Passing for Multinomial and Binary Regression}, booktitle = nips25, year = 2011, abstract = {Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.}, url = {.}, annote = {web site supplementary} } @inproceedings{KnoParGlaWin10, cat = {bioinf}, author = {David A. Knowles and Leopold Parts and Daniel Glass and John M. Winn}, title = {Modeling skin and ageing phenotypes using latent variable models in Infer.NET}, booktitle = {NIPS Workshop: Predictive Models in Personalized Medicine Workshop}, year = 2010, abstract = {We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study. We address various data modeling problems include high missingness, heterogeneous data, and repeat observations. We compare the proposed models in terms of their performance at predicting disease labels and symptoms from available explanatory variables, concluding that factor analysis type models have the strongest statistical performance in this setting. We show that such models can be combined with regression components for improved interpretability.}, url = {.}, annote = {web site} } @inproceedings{KnoParGlaWin11, cat = {bioinf}, author = {David A. Knowles and Leopold Parts and Daniel Glass and John M. Winn}, title = {Inferring a measure of physiological age from multiple ageing related phenotypes}, booktitle = {NIPS Workshop: From Statistical Genetics to Predictive Models in Personalized Medicine}, year = 2011, abstract = {What is ageing? One definition is simultaneous degradation of multiple organ systems. Can an individual be said to be "old" or "young" for their (chronological) age in a scientifically meaningful way? We investigate these questions using ageing related phenotypes measured on the 12,000 female twins in the Twins UK study. We propose a simple linear model of ageing, which allows a latent adjustment to be made to an individual's chronological age to give her "physiological age", shared across the observed phenotypes. We note problems with the analysis resulting from the linearity assumption and show how to alleviate these issues using a non-linear extension. We find more gene expression probes are significantly associated with our measurement of physiological age than to chronological age. }, url = {.}, annote = {web site} } @inproceedings{KocBanLiketal03, cat = {gp}, booktitle = {IFAC Internaltional Conference on Intelligent Control Systems and Signal Processing}, title = {A case based comparison of identification with neural network and {G}aussian process models}, author = {Ju{\v s} Kocijan and Bla{\v z} Banko and Bojan Likar and Agathe Girard and Roderick Murray-Smith and Carl Edward Rasmussen}, year = 2003, volume = 1, pages = {137--142}, url = {.}, abstract = {In this paper an alternative approach to black-box identification of non-linear dynamic systems is compared with the more established approach of using artificial neural networks. The Gaussian process prior approach is a representative of non-parametric modelling approaches. It was compared on a pH process modelling case study. The purpose of modelling was to use the model for control design. The comparison revealed that even though Gaussian process models can be effectively used for modelling dynamic systems caution has to be axercised when signals are selected.} } @inproceedings{KocMurRasGir04, cat = {gp rl}, author = {Ju{\v s} Kocijan and Roderick Murray-Smith and Carl Edward Rasmussen and Agathe Girard}, title = {Gaussian process model based predictive control}, year = 2004, pages = {2214--2219}, journal = {Proceedings of the ACC 2004}, abstract = {Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identi cation of non-linear dynamic systems. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. Gaussian process models contain noticeably less coef cients to be optimised. This paper illustrates possible application of Gaussian process models within model-based predictive control. The extra information provided within Gaussian process model is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on control of pH process benchmark.}, booktitle = {American Control Conference}, location = {Boston, MA}, url = {.} } @inproceedings{KocMurRasLik03, cat = {gp rl}, title = {Predictive control with {G}aussian process models}, author = {Ju{\v s} Kocijan and Roderick Murray-Smith and Carl Edward Rasmussen and Bojan Likar}, editor = {B.~Zajc and M.~Tkal}, pages = {352--356}, url = {.}, year = 2003, booktitle = {IEEE Region 8 Eurocon 2003: Computer as a Tool}, abstract = {This paper describes model-based predictive control based on Gaussian processes. Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identification of non-linear dynamic systems. It offers more insight in variance of obtained model response, as well as fewer parameters to determine than other models. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. This property is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on a simulated example of nonlinear system.} } @article{KokHolSch17, cat = {sigproc review}, author = {Manon Kok and Jeroen D. Hol and Thomas B. Sch\"on}, title = {Using Inertial Sensors for Position and Orientation Estimation}, year = 2017, journal = {Foundations and Trends in Signal Processing (Accepted for publication)}, url = {https://arxiv.org/abs/1704.06053}, abstract = {In recent years, MEMS inertial sensors (3D accelerometers and 3D gyroscopes) have become widely available due to their small size and low cost. Inertial sensor measurements are obtained at high sampling rates and can be integrated to obtain position and orientation information. These estimates are accurate on a short time scale, but suffer from integration drift over longer time scales. To overcome this issue, inertial sensors are typically combined with additional sensors and models. In this tutorial we focus on the signal processing aspects of position and orientation estimation using inertial sensors. We discuss different modeling choices and a selected number of important algorithms. The algorithms include optimization-based smoothing and filtering as well as computationally cheaper extended Kalman filter and complementary filter implementations. The quality of their estimates is illustrated using both experimental and simulated data.} } @inproceedings{KorCheWel14, author = {Anoop Korattikara and Yutian Chen and Max Welling}, cat = {mcmc approx}, title = {Austerity in {MCMC} Land: Cutting the {M}etropolis-{H}astings Budget}, booktitle = icml31, pages = {181-189}, year = 2014, month = {June}, address = {Beijing, China}, url = {http://jmlr.org/proceedings/papers/v32/korattikara14.pdf}, annote = {supplementary}, abstract = {Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time. } } @article{KraStrRadetal13, cat = {active}, title = {Experimental Adaptive {B}ayesian Tomography}, author = {Konstantin Kravtsov and Stanislav Straupe and Igor Radchenko and Neil Houlsby and Ferenc Husz{\'a}r and Sergey Kulik}, journal = {Physical Review A}, volume = 87, number = 6, pages = 062122, year = 2013, publisher = {APS}, url = {http://link.aps.org/doi/10.1103/PhysRevA.87.062122}, abstract = {We report an experimental realization of an adaptive quantum state tomography protocol. Our method takes advantage of a Bayesian approach to statistical inference and is naturally tailored for adaptive strategies. For pure states we observe close to $N^{-1}$ scaling of infidelity with overall number of registered events, while best non-adaptive protocols allow for $N^{-1/2}$ scaling only. Experiments are performed for polarization qubits, but the approach is readily adapted to any dimension.} } @techreport{KusPfiCsaRas05, cat = {gp}, author = {Malte Ku{\ss} and Tobias Pfingsten and Lehel Csat{\`o} and Carl Edward Rasmussen}, title = {Approximate Inference for Robust {G}aussian Process Regression}, year = 2005, institution = {Max Planck Institute for Biological Cybernetics}, number = 136, address = {T{\"u}bingen, Germany}, abstract = {Gaussian process (GP) priors have been successfully used in non-parametric Bayesian regression and classification models. Inference can be performed analytically only for the regression model with Gaussian noise. For all other likelihood models inference is intractable and various approximation techniques have been proposed. In recent years expectation-propagation (EP) has been developed as a general method for approximate inference. This article provides a general summary of how expectation-propagation can be used for approximate inference in Gaussian process models. Furthermore we present a case study describing its implementation for a new robust variant of Gaussian process regression. To gain further insights into the quality of the EP approximation we present experiments in which we compare to results obtained by Markov chain Monte Carlo (MCMC) sampling.}, url = {.} } @article{KusRas05, cat = {gp approx}, author = {Malte {Ku\ss} and Carl Edward Rasmussen}, title = {Assessing Approximate Inference for Binary {G}aussian Process Classification}, journal = jmlr, year = 2005, volume = 6, pages = {1679--1704}, url = {http://www.jmlr.org/papers/volume6/kuss05a/kuss05a.pdf}, abstract = {Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace's method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace's method.} } @inproceedings{KusRas06, cat = {gp rl}, author = {Malte Ku{\ss} and Carl Edward Rasmussen}, title = {Assessing Approximations for {G}aussian Process Classification}, year = 2006, publisher = mit, pages = {699--706}, month = {April}, booktitle = nips18, editor = {Y.~Weiss and B.~Sch{\"o}lkopf and J.~Platt}, address = {Cambridge, MA, USA}, abstract = {Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace's method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.}, location = {Whistler, BC, Canada}, URL = {.} } @inproceedings{LacHusGha11, cat = {approx gp}, url = {.}, author = {Simon Lacoste-Julien and Ferenc Husz\'{a}r and Zoubin Ghahramani}, title = {Approximate Inference for the Loss-Calibrated {B}ayesian}, booktitle = aistats14, year = 2011, editor = {Geoff Gordon and David Dunson}, volume = 15, pages = {416--424}, address = {Fort Lauderdale, FL, USA}, month = {April}, publisher = jmlr, abstract = {We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task -- and associated losses -- for which the posterior could be used. We argue that this can be suboptimal and propose instead to \emph{loss-calibrate} the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.} } @inproceedings{LacPalDav13a, author = {Lacoste-Julien, Simon and Palla, Konstantina and Davies, Alex and Kasneci, Gjergji and Graepel, Thore and Ghahramani, Zoubin}, booktitle = {KDD}, cat = {ir}, isbn = {978-1-4503-2174-7}, pages = {572-580}, publisher = acm, title = {SIGMa: simple greedy matching for aligning large knowledge bases}, url = {.}, year = 2013, abstract = {The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency. } } @article{LazQuiRasFig10, cat = {gp}, author = {Miguel L\'azaro-Gredilla and Joaquin Qui{\~n}onero-Candela and Carl Edward Rasmussen and An\'{i}bal Figueiras-Vidal}, title = {Sparse Spectrum {G}aussian Process Regression}, journal = jmlr, url = {http://jmlr.csail.mit.edu/papers/volume11/lazaro-gredilla10a/lazaro-gredilla10a.pdf}, volume = 11, pages = {1865--1881}, month = {June}, year = 2010, abstract = {We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the \emph{spectral representation} of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class.} } @article{LeoStoTurGos2014, cat = {neuro}, author = {Victoria Leong and Michael A Stone and Richard E Turner and Usha Goswami}, title = {A role for amplitude modulation phase relationships in speech rhythm perception}, journal = {Journal of the Acoustical Society of America}, year = 2014, volume = 136, pages = {366--381}, publisher = {Acoustical Society of America}, abstract = {Prosodic rhythm in speech [the alternation of "Strong" (S) and "weak" (w) syllables] is cued, among others, by slow rates of amplitude modulation (AM) within the speech envelope. However, it is unclear exactly which envelope modulation rates and statistics are the most important for the rhythm percept. Here, the hypothesis that the phase relationship between "Stress" rate (~2 Hz) and "Syllable" rate (~4 Hz) AMs provides a perceptual cue for speech rhythm is tested. In a rhythm judgment task, adult listeners identified AM tone-vocoded nursery rhyme sentences that carried either trochaic (S-w) or iambic patterning (w-S). Manipulation of listeners' rhythm perception was attempted by parametrically phase-shifting the Stress AM and Syllable AM in the vocoder. It was expected that a 1π radian phase-shift (half a cycle) would reverse the perceived rhythm pattern (i.e., trochaic -> iambic) whereas a 2\pi radian shift (full cycle) would retain the perceived rhythm pattern (i.e., trochaic -> trochaic). The results confirmed these predictions. Listeners judgments of rhythm systematically followed Stress-Syllable AM phase-shifts, but were unaffected by phase-shifts between the Syllable AM and the Sub-beat AM (~14 Hz) in a control condition. It is concluded that the Stress-Syllable AM phase relationship is an envelope-based modulation statistic that supports speech rhythm perception.}, url = {http://scitation.aip.org/content/asa/journal/jasa/136/1/10.1121/1.4883366}, } @article{LesChaKleetal10, author = {J. Leskovec and D. Chakrabarti and J. Kleinberg and C. Faloutsos and Z. Ghahramani}, year = 2010, title = {Kronecker Graphs: An Approach to Modeling Networks}, journal = jmlr, volume = {11(Feb)}, pages = {985--1042}, url = {.}, abstract = {How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as "Kronecker graphs".

First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks.

We then present KRONFIT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, KRONFIT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that KRONFIT finds accurate parameters that very well mimic the properties of target networks. In fact, using just four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synthetic graphs can be used for null-models, anonymization, extrapolations, and graph summarization.} } @inproceedings{LiHerTur15, cat = {approx deep}, title = {{S}tochastic {E}xpectation {P}ropagation}, author = {Yingzhen Li and Jos\'e Miguel Hern\'andez-Lobato and Richard E.~Turner}, booktitle = nips28, year = 2015, month = {Dec}, address = {Montréal CANADA}, url = {http://arxiv.org/abs/1506.04132}, abstract = {Expectation propagation (EP) is a deterministic approximation algorithm that is often used to perform approximate Bayesian parameter learning. EP approximates the full intractable posterior distribution through a set of local-approximations that are iteratively refined for each datapoint. EP can offer analytic and computational advantages over other approximations, such as Variational Inference (VI), and is the method of choice for a number of models. The local nature of EP appears to make it an ideal candidate for performing Bayesian learning on large models in large-scale datasets settings. However, EP has a crucial limitation in this context: the number approximating factors needs to increase with the number of data-points, N, which often entails a prohibitively large memory overhead. This paper presents an extension to EP, called stochastic expectation propagation (SEP), that maintains a global posterior approximation (like VI) but updates it in a local way (like EP ). Experiments on a number of canonical learning problems using synthetic and real-world datasets indicate that SEP performs almost as well as full EP, but reduces the memory consumption by a factor of N. SEP is therefore ideally suited to performing approximate Bayesian learning in the large model, large dataset setting.} } @inproceedings{LiTur16, cat = {approx deep}, title = {{R}\'enyi {D}ivergence {V}ariational {I}nference}, author = {Yingzhen Li and Richard E.~Turner}, booktitle = nips29, year = 2016, month = {Dec}, address = {Barcelona SPAIN}, url = {https://papers.nips.cc/paper/6208-renyi-divergence-variational-inference}, abstract = {This paper introduces the variational R{\'e}nyi bound (VR) that extends traditional variational inference to R{\'e}nyi's alpha-divergences. This new family of variational methods unifies a number of existing approaches, and enables a smooth interpolation from the evidence lower-bound to the log (marginal) likelihood that is controlled by the value of alpha that parametrises the divergence. The reparameterization trick, Monte Carlo approximation and stochastic optimisation methods are deployed to obtain a tractable and unified framework for optimisation. We further consider negative alpha values and propose a novel variational inference method as a new special case in the proposed framework. Experiments on Bayesian neural networks and variational auto-encoders demonstrate the wide applicability of the VR bound.} } @inproceedings{LiGal17, cat = {approx deep}, title = {{D}ropout {I}nference in {B}ayesian {N}eural {N}etworks with {A}lpha-divergences}, author = {Yingzhen Li and Yarin Gal}, booktitle = icml34, year = 2017, month = {Aug}, address = {Sydney AUSTRALIA}, url = {http://proceedings.mlr.press/v70/li17a.html}, abstract = {To obtain uncertainty estimates with real-world Bayesian deep learning models, practical inference approximations are needed. Dropout variational inference (VI) for example has been used for machine vision and medical applications, but VI can severely underestimates model uncertainty. Alpha-divergences are alternative divergences to VI’s KL objective, which are able to avoid VI’s uncertainty underestimation. But these are hard to use in practice: existing techniques can only use Gaussian approximating distributions, and require existing models to be changed radically, thus are of limited use for practitioners. We propose a re-parametrisation of the alpha-divergence objectives, deriving a simple inference technique which, together with dropout, can be easily implemented with existing models by simply changing the loss of the model. We demonstrate improved uncertainty estimates and accuracy compared to VI in dropout networks. We study our model’s epistemic uncertainty far away from the data using adversarial images, showing that these can be distinguished from non-adversarial images by examining our model’s uncertainty. } } @inproceedings{LimCalMac17, cat = {np rl}, author = {Daniel Limon and Jan-Peter Calliess and Jan Maciejowski}, title = {Learning-based Nonlinear Model Predictive Control}, booktitle = {IFAC 2017 World Congress}, year = 2017, address = {Toulouse, France}, month = {July}, abstract = {This paper presents stabilizing Model Predictive Controllers (MPC) in which prediction models are inferred from experimental data of the inputs and outputs of the plant. Using a nonparametric machine learning technique called LACKI, the estimated (possibly nonlinear) model function together with an estimation of Hoelder constant is provided. Based on these, a number of predictive controllers with stability guaranteed by design are proposed. Firstly, the case when the prediction model is estimated off- line is considered and robust stability and recursive feasibility is ensured by using tightened constraints in the optimisation problem. This controller has been extended to the more interesting and complex case: the online learning of the model, where the new data collected from feedback is added to enhance the prediction model. A on-line learning MPC based on a double sequence of predictions is proposed. Stability of the online learning MPC is proved. These controllers are illustrated by simulation. } } @article{LipGhaBor10, cat = {bioinf}, author = {C. Lippert and Z. Ghahramani and K. Borgwardt}, title = {Gene function prediction from synthetic lethality networks via ranking on demand}, journal = {Bioinformatics}, year = 2010, volume = 26, pages = {912--918}, url = {.}, abstract = {Motivation: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks.

Results: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.} } @inproceedings{LipSteGhaetal09, cat = {bioinf}, volume = 5, author = {C. Lippert and O. Stegle and Z. Ghahramani and K. Borgwardt}, note = {ISSN: 1938-7228}, booktitle = aistats12, editor = {D. van Dyk and M. Welling}, title = {A kernel method for unsupervised structured network inference}, publisher = jmlr, year = 2009, month = {April}, address = {Clearwater Beach, FL, USA}, pages = {368--375}, url = {.}, abstract = {Network inference is the problem of inferring edges between a set of real-world objects, for instance, interactions between pairs of proteins in bioinformatics. Current kernel-based approaches to this problem share a set of common features: (i) they are supervised and hence require labeled training data; (ii) edges in the network are treated as mutually independent and hence topological properties are largely ignored; (iii) they lack a statistical interpretation. We argue that these common assumptions are often undesirable for network inference, and propose (i) an unsupervised kernel method (ii) that takes the global structure of the network into account and (iii) is statistically motivated. We show that our approach can explain commonly used heuristics in statistical terms. In experiments on social networks, dfferent variants of our method demonstrate appealing predictive performance.} } @article{Llo13, cat = {gp}, title = {GEFCom2012 Hierarchical Load Forecasting: Gradient Boosting Machines and Gaussian Processes}, author = {James Robert Lloyd}, year = 2013, abstract = {This report discusses methods for forecasting hourly loads of a US utility as part of the load forecasting track of the Global Energy Forecasting Competition 2012 hosted on Kaggle. The methods described (gradient boosting machines and Gaussian processes) are generic machine learning / regression algorithms and few domain specific adjustments were made. Despite this, the algorithms were able to produce highly competitive predictions and hopefully they can inspire more reﬁned techniques to compete with state-of-the-art load forecasting methodologies.}, url = {http://mlg.eng.cam.ac.uk/lloyd/papers/GEFCom2012.pdf}, journal = {International Journal of Forecasting} } @phdthesis{Llo15, author = {James Rovert Lloyd}, cat = {gp time np network}, title = {Representation, learning, description and criticism of probabilistic models with applications to networks, functions and relational data}, school = {University of Cambridge, Department of Engineering}, year = 2015, abstract = {This thesis makes contributions to a variety of aspects of probabilistic inference. When performing probabilistic inference, one must first represent one’s beliefs with a probability distribution. Specifying the details of a probability distribution can be a difficult task in many situations, but when expressing beliefs about complex data structures it may not even be apparent what form such a distribution should take. This thesis starts by demonstrating how representation theorems due to Aldous, Hoover and Kallenberg can be used to specify appropriate models for data in the form of networks. These theorems are then extended in order to reveal appropriate probability distributions for arbitrary relational data or databases. A simpler data structure to specify probability distributions for is that of functions; many probability distributions for functions have been used for centuries. We demonstrate that many of these distributions can be expressed in a common language of Gaussian process kernels constructed from a few base elements and operators. The structure of this language allows for the effective automatic construction of probabilistic models for functions. Furthermore, the formal mathematical language of kernels can be mapped neatly onto natural language allowing for automatic descriptions of the automatically constructed models. By further automating the construction of statistical models, the need to be able to effectively check or criticise these models becomes greater. This thesis demonstrates how kernel two sample tests can be used to demonstrate where a probabilistic model most disagrees with data allowing for targeted improvements to the model. In proposing a new method of model criticism this thesis also briefly discusses the philosophy of model criticism within the context of probabilistic inference.}, address = {Cambridge, UK}, url = {https://github.com/jamesrobertlloyd/phd-thesis/raw/master/thesis-final.pdf} } @inproceedings{LloDuvGroetal14, cat = {gp np time}, author = {James Robert Lloyd and David Duvenaud and Roger Grosse and Joshua B. Tenenbaum and Zoubin Ghahramani}, title = {Automatic Construction and Natural-Language Description of Nonparametric Regression Models}, booktitle = {Association for the Advancement of Artificial Intelligence (AAAI)}, abstract = {This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.}, month = {July}, year = 2014, url = {http://arxiv.org/pdf/1402.4304.pdf} } @inproceedings{LloGha15, cat = {gp crit np}, title = {Statistical Model Criticism using Kernel Two Sample Tests}, author = {James Robert Lloyd and Zoubin Ghahramani}, abstract = {We propose an exploratory approach to statistical model criticism using maximum mean discrepancy (MMD) two sample tests. Typical approaches to model criticism require a practitioner to select a statistic by which to measure discrepancies between data and a statistical model. MMD two sample tests are instead constructed as an analytic maximisation over a large space of possible statistics and therefore automatically select the statistic which most shows any discrepancy. We demonstrate on synthetic data that the selected statistic, called the witness function, can be used to identify where a statistical model most misrepresents the data it was trained on. We then apply the procedure to real data where the models being assessed are restricted Boltzmann machines, deep belief networks and Gaussian process regression and demonstrate the ways in which these models fail to capture the properties of the data they are trained on.}, url = {http://jamesrobertlloyd.com/papers/kernel-model-checking.pdf}, booktitle = nips29, year = 2015, address = {Montreal, Canada}, month = {December}, pages = {1--9} } @inproceedings{LloOrbGhaRoy12, cat = {gp np network}, title = {Random function priors for exchangeable arrays with applications to graphs and relational data}, author = {James Robert Lloyd and Peter Orbanz and Zoubin Ghahramani and Daniel M. Roy}, abstract = {A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. }, url = {http://books.nips.cc/papers/files/nips25/NIPS2012_0487.pdf}, booktitle = nips26, year = 2012, address = {Lake Tahoe, California, USA}, month = {December}, pages = {1--9} } @inproceedings{LopHenSco13, cat = {sigproc}, booktitle = nips27, title = {The Randomized Dependence Coefficient}, author = {David Lopez-Paz and Philipp Hennig and Bernhard Scholk\"{o}pf}, year = 2013, address = {Lake Tahoe, California, USA}, month = {December}, pages = {1--9}, url = {http://arxiv.org/pdf/1304.7717}, abstract = {We introduce the {Randomized Dependence Coefficient (RDC)}, a measure of non-linear dependence between random variables of arbitrary dimension based on the {Hirschfeld-Gebelein-R\'enyi Maximum Correlation Coefficient}. {RDC} is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of {R} code, included at the end of the paper.} } @inproceedings{LopHerGha13, cat = {gp np approx}, title = {Gaussian Process Vine Copulas for Multivariate Dependence}, author = {David Lopez-Paz and Jos\'e Miguel Hern\'andez-Lobato and Zoubin Ghahramani}, booktitle = icml30, year = 2013, month = {June}, address = {Atlanta, Georgia, USA}, url = {http://jmlr.org/proceedings/papers/v28/lopez-paz13.pdf}, abstract = {Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.} } @inproceedings{LopHerSco12, cat = {ssl}, booktitle = nips26, title = {Semi-Supervised Domain Adaptation with Non-Parametric Copulas}, author = {David Lopez-Paz and Jos\'e Miguel Hern\'andez-Lobato and Bernhard Scholk\"{o}pf}, year = 2012, address = {Lake Tahoe, California, USA}, month = {December}, pages = {1--9}, url = {http://books.nips.cc/papers/files/nips25/NIPS2012_0307.pdf}, abstract = {A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques.} } @inproceedings{LopSraSmo14a, author = {Lopez-Paz, David and Sra, Suvrit and Smola, Alex J. and Ghahramani, Zoubin and Sch{\"o}lkopf, Bernhard}, booktitle = {ICML}, publisher = {JMLR.org}, series = {JMLR Proceedings}, title = {Randomized Nonlinear Component Analysis}, url = {.}, volume = 29, year = 2014, abstract = {Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.} } @incollection{LucTurSahHen09a, cat = {approx neuro mvision}, title = {Occlusive Components Analysis}, author = {J\"{o}rg L\"{u}cke and Richard E. Turner and Maneesh Sahani and Marc Henniges}, booktitle = {nips22}, publisher = {mit}, editor = {Y Bengio and D Schuurmans and J Lafferty and C K I Williams and A Culotta}, pages = {1069--1077}, year = 2009, abstract = {We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learnt from an unlabelled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. However, we show that tractable approximations to Expectation Maximization (EM) can be found if the training images each contain only a small number of objects on average. In numerical experiments it is shown that these approximations recover the correct set of object parameters. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating causes. Experiments based on the standard bars benchmark test for object learning show that the algorithm performs well in comparison to other recent component extraction approaches. The model and the learning algorithm thus connect research on occlusion with the research field of multiple-causes component extraction methods.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/lucke-et-al-2009.html}, } @inproceedings{MacBusCunetal11, cat = {time neuro}, booktitle = nips25, title = {Empirical models of spiking in neural populations}, author = {J. H. Macke and L. Busing and J. P. Cunningham and B. M. Yu and K. V. Shenoy and M. Sahani}, year = 2011, address = {Granada, Spain}, month = {December}, pages = {1--8}, url = {.}, abstract = {Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-of- fit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts.} } @article{MatGha14, cat = {ssl np}, title = {{C}lassification using log {G}aussian {C}ox processes}, author = {Matthews, Alexander G. D. G. and Ghahramani, Zoubin }, year = 2014, journal = "arXiv preprint arXiv:1405.4141", url = {http://arxiv.org/abs/1405.4141}, abstract = {{M}c{C}ullagh and {Y}ang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.} } @inproceedings{MatHenGha14, cat = {approx}, title = {Comparing lower bounds on the entropy of mixture distributions for use in variational inference}, author = {Matthews, Alexander G. D. G and Hensman, James and Ghahramani, Zoubin}, address = {Montreal, Canada}, booktitle = {{NIPS} workshop on {A}dvances in {V}ariational {I}nference}, url = {https://sites.google.com/site/variationalworkshop/accepted-papers}, year = 2014, month = {December} } @inproceedings{MatHenTurGha15, cat = {gp np approx}, title = {On {S}parse {V}ariational methods and the {K}ullback-{L}eibler divergence between stochastic processes}, author = {Alexander G D G Matthews and James Hensman and Richard E. Turner and Zoubin Ghahramani}, booktitle = aistats19, year = 2016, month = {May}, address = {Cadiz, Spain}, url = {http://arxiv.org/abs/1504.07027}, abstract = {The variational framework for learning inducing variables (Titsias, 2009a) has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes.} } @phdthesis{Mca16, author = {Rowan McAllister}, cat = {time rl}, title = {Bayesian Learning for Data-Efficient Control}, school = {University of Cambridge, Department of Engineering}, year = 2016, address = {Cambridge, UK}, url = {.}, abstract = {Applications to learn control of unfamiliar dynamical systems with increasing autonomy are ubiquitous. From robotics, to finance, to industrial processing, autonomous learning helps obviate a heavy reliance on experts for system identification and controller design. Often real world systems are nonlinear, stochastic, and expensive to operate (e.g. slow, energy intensive, prone to wear and tear). Ideally therefore, nonlinear systems can be identified with minimal system interaction. This thesis considers data efficient autonomous learning of control of nonlinear, stochastic systems. Data efficient learning critically requires probabilistic modelling of dynamics. Traditional control approaches use deterministic models, which easily overfit data, especially small datasets. We use probabilistic Bayesian modelling to learn systems from scratch, similar to the PILCO algorithm, which achieved unprecedented data efficiency in learning control of several benchmarks. We extend PILCO in three principle ways. First, we learn control under significant observation noise by simulating a filtered control process using a tractably analytic framework of Gaussian distributions. In addition, we develop the `latent variable belief Markov decision process' when filters must predict under real-time constraints. Second, we improve PILCO's data efficiency by directing exploration with predictive loss uncertainty and Bayesian optimisation, including a novel approximation to the Gittins index. Third, we take a step towards data efficient learning of high-dimensional control using Bayesian neural networks (BNN). Experimentally we show although filtering mitigates adverse effects of observation noise, much greater performance is achieved when optimising controllers with evaluations faithful to reality: by simulating closed-loop filtered control if executing closed-loop filtered control. Thus, controllers are optimised w.r.t. how they are used, outperforming filters applied to systems optimised by unfiltered simulations. We show directed exploration improves data efficiency. Lastly, we show BNN dynamics models are almost as data efficient as Gaussian process models. Results show data efficient learning of high-dimensional control is possible as BNNs scale to high-dimensional state inputs.} } @inproceedings{McaRas17, cat = {gp time rl}, booktitle = nips31, title = {Data-Efficient Reinforcement Learning in Continuous State-Action {G}aussian-{POMDP}s}, author = {Rowan McAllister and Carl Edward Rasmussen}, year = 2017, month = {December}, address = {Long Beach, California}, url = {.}, abstract = {We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.} } @inproceedings{McaGalKenEtal17, cat = {mvision}, booktitle = ijcai, title = {Concrete Problems for Autonomous Vehicle Safety: Advantages of {B}ayesian Deep Learning,}, author = {Rowan McAllister and Yarin Gal and Alex Kendall and Mark van der Wilk and Amar Shah and Roberto Cipolla and Adrian Weller}, year = 2017, month = {August}, address = {Melbourne, Australia}, url = {https://www.ijcai.org/proceedings/2017/0661.pdf}, abstract = {Autonomous vehicle (AV) software is typically composed of a pipeline of individual components, linking sensor inputs to motor outputs. Erroneous component outputs propagate downstream, hence safe AV software must consider the ultimate effect of each component's errors. Further, improving safety alone is not sufficient. Passengers must also feel safe to trust and use AV systems. To address such concerns, we investigate three under-explored themes for AV research: safety, interpretability, and compliance. Safety can be improved by quantifying the uncertainties of component outputs and propagating them forward through the pipeline. Interpretability is concerned with explaining what the AV observes and why it makes the decisions it does, building reassurance with the passenger. Compliance refers to maintaining some control for the passenger. We discuss open challenges for research within these themes. We highlight the need for concrete evaluation metrics, propose example problems, and highlight possible solutions.} } @phdthesis{Mch14, author = {Andrew McHutchon}, cat = {gp time rl}, title = {Nonlinear Modelling and Control using {G}aussian Processes}, school = {University of Cambridge, Department of Engineering}, year = 2014, address = {Cambridge, UK}, abstract = {In many scientific disciplines it is often required to make predictions about how a system will behave or to deduce the correct control values to elicit a particular desired response. Efficiently solving both of these tasks relies on the construction of a model capturing the system's operation. In the most interesting situations, the model needs to capture strongly nonlinear effects and deal with the presence of uncertainty and noise. Building models for such systems purely based on a theoretical understanding of underlying physical principles can be infeasibly complex and require a large number of simplifying assumptions. An alternative is to use a data-driven approach, which builds a model directly from observations. A powerful and principled approach to doing this is to use a Gaussian Process (GP).

In this thesis we start by discussing how GPs can be applied to data sets which have noise affecting their inputs. We present the "Noisy Input GP", which uses a simple local-linearisation to refer the input noise into heteroscedastic output noise, and compare it to other methods both theoretically and empirically. We show that this technique leads to a effective model for nonlinear functions with input and output noise. We then consider the broad topic of GP state space models for application to dynamical systems. We discuss a very wide variety of approaches for using GPs in state space models, including introducing a new method based on moment-matching, which consistently gave the best performance. We analyse the methods in some detail including providing a systematic comparison between approximate-analytic and particle methods. To our knowledge such a comparison has not been provided before in this area. Finally, we investigate an automatic control learning framework, which uses Gaussian Processes to model a system for which we wish to design a controller. Controller design for complex systems is a difficult task and thus a framework which allows an automatic design directly from data promises to be extremely useful. We demonstrate that the previously published framework cannot cope with the presence of observation noise but that the introduction of a state space model dramatically improves its performance. This contribution, along with some other suggested improvements opens the door for this framework to be used in real-world applications.}, url = {.} } @inproceedings{MchRas11, cat = {gp}, booktitle = nips24, title = {Gaussian Process Training with Input Noise}, author = {Andrew McHutchon and Carl Edward Rasmussen}, year = 2011, address = {Granada, Spain}, editor = {J. Shawe-Taylor and R.S. Zemel and P.L. Bartlett and F. Pereira and K.Q. Weinberger}, publisher = {Curran Associates, Inc.}, pages = {1341--1349}, url = {.}, abstract = {In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise hyperparameters are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood, and allow estimation of the noise levels on each input dimension. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.} } @inproceedings{MeeGhaNeaetal07, cat = {np}, month = {September}, author = {E. Meeds and Z. Ghahramani and R. Neal and S.T. Roweis}, series = {Bradford Books}, note = {Online contents gives pages 1002--1009, and 977--984 on pdf contents.}, booktitle = nips19, editor = {B. Sch\"olkopf and J. Platt and T. Hofmann}, title = {Modelling dyadic data with binary latent factors}, address = {Cambridge, MA, USA}, publisher = mit, year = 2007, pages = {977--984}, url = {.}, abstract = {We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data.} } @incollection{MenFreMiretal, cat = {binf}, author = {Peter Menzel and Jes Frellsen and Mireya Plass and Simon H. Rasmussen and Anders Krogh}, booktitle = {Deep Sequencing Data Analysis}, title = {On the Accuracy of Short Read Mapping}, publisher = {Springer}, year = 2013, volume = 1038, pages = {39--59}, abstract = {The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.}, doi = {10.1007/978-1-62703-514-9_3}, annote = {Peter Menzel and Jes Frellsen contributed equally.} } @inproceedings{MesMahWelSon16, cat = {gm}, title = {Train and Test Tightness of {LP} Relaxations in Structured Prediction}, booktitle = icml33, year = 2016, month = {June}, address = {New York, NY}, url = {http://jmlr.org/proceedings/papers/v48/meshi16.html}, author = {Ofer Meshi and Mehrdad Mahdavi and Adrian Weller and David Sontag}, abstract = {Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming (LP) relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data.} } @phdthesis{Moh11, author = {Shakir Mohamed}, title = {Generalised {B}ayesian Matrix Factorisation Models}, school = {University of Cambridge, Department of Engineering}, year = 2011, address = {Cambridge, UK}, abstract = {Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others.

This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models. The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression.

Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation.}, url = {.} } @inproceedings{MohHelGha08, title = {Bayesian Exponential Family {PCA}}, author = {Shakir Mohamed and Katherine A. Heller and Zoubin Ghahramani}, booktitle = nips21, editor = {D. Koller and D. Schuurmans and Y. Bengio and L. Bottou}, pages = {1089--1096}, year = 2009, month = {December}, address = {Cambridge, MA, USA}, publisher = mit, url = {.}, abstract = {Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.}, annote = {spotlight.} } @inproceedings{MohHelGha12, author = {Shakir Mohamed and Katherine A. Heller and Zoubin Ghahramani}, year = 2012, title = {{B}ayesian and {L}

Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret.

The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method.

The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.} } @proceedings{RasBueGieSch04, title = {Pattern Recognition: 26th {DAGM} Symposium}, editor = {Carl Edward Rasmussen and Heinrich H.~B{\"u}lthoff and Martin A.~Giese and Bernhard Sch{\"o}lkopf}, year = 2004, publisher = {Springer}, pages = 581, month = {August}, series = lncs, volume = 3175, address = {Berlin, Germany}, location = {T{\"u}bingen, Germany}, url = {http://dx.doi.org/10.1007/b99676}, doi = {10.1007/b99676} } @article{RasCruGhaWil09, cat = {clust bioinf}, author = {Carl Edward Rasmussen and Bernhard J.~de la Cruz and Zoubin Ghahramani and David L.~Wild}, title = {Modeling and Visualizing Uncertainty in Gene Expression Clusters Using {D}irichlet Process Mixtures}, journal = tcbb, volume = 6, number = 4, issn = {1545-5963}, pages = {615--628}, year = 2009, url = {http://dx.doi.org/10.1109/TCBB.2007.70269}, abstract = {Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.}, doi = {10.1109/TCBB.2007.70269} } @inproceedings{RasDei08, cat = {rl}, title = {Probabilistic Inference for Fast Learning in Control}, pages = {229--242}, booktitle = {Recent Advances in Reinforcement Learning}, publisher = {Springer-Verlag}, year = 2008, editor = {S. Girgin and M. Loth and R. Munos and P. Preux and D. Ryabko}, author = {Carl Edward Rasmussen and Marc Peter Deisenroth}, volume = 5323, series = lncs, type = {Lecture Notes in Artificial Intelligence}, month = {November}, address = {Villeneuve d'Ascq, France}, url = {.}, abstract = {We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations.}, annote = {videos and more. slides.} } @inproceedings{RasGha01, cat = {np review}, author = {Carl Edward Rasmussen and Zoubin Ghahramani}, title = {{O}ccam's Razor}, booktitle = nips13, address = {Cambridge, MA, USA}, publisher = mit, editors = {T. Leen, T. G. Diettrich and V. Tresp}, pages = {294--300}, year = 2001, month = {December}, url = {.}, abstract = {The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.} } @inproceedings{RasGha02, cat = {gp np}, author = {Carl Edward Rasmussen and Zoubin Ghahramani}, title = {Infinite Mixtures of {G}aussian Process Experts}, booktitle = nips14, year = 2002, month = {December}, editors = {Dietterich, T. G., Becker, S. and Ghahramani, Z.}, pages = {881--888}, address = {Cambridge, MA, USA}, publisher = mit, url = {.}, abstract = {We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets --- thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.} } @inproceedings{RasGha03, cat = {gp mcmc}, author = {Carl Edward Rasmussen and Zoubin Ghahramani}, title = {Bayesian {Monte Carlo}}, booktitle = nips15, pages = {489--496}, year = 2003, month = {December}, address = {Cambridge, MA, USA}, editor = {S.~Becker and S.~Thrun and K.~Obermayer}, publisher = mit, url = {.}, abstract = {We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.} } @inproceedings{RasKus04, cat = {rl}, author = {Carl Edward Rasmussen and Malte Ku\ss}, booktitle = nips16, editor = {S. Thrun and L.K. Saul and B. Sch\"olkopf}, title = {Gaussian processes in reinforcement learning}, address = {Cambridge, MA, USA}, publisher = mit, year = 2004, month = {December}, pages = {751--759}, url = {.}, abstract = {We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.} } @manual{RasNeaHinetal96, author = {Carl Edward Rasmussen and Radford M.~Neal and Geoffrey E.~Hinton and Drew van Camp and Mike Revow and Zoubin Ghahramani and Rafal Kustra and Robert Tibshirani}, title = {The {DELVE} manual}, pages = {1--108}, year = 1996, abstract = {DELVE -- Data for Evaluating Learning in Valid Experiments -- is a collection of datasets from many sources, an environment within which this data can be used to assess the performance of methods for learning relationships from data, and a repository for the results of such experiments.}, url = {.}, annote = {The delve website.}, organizaton = {University of Toronto} } @article{RasNic10, cat = {gp}, author = {Carl Edward Rasmussen and Hannes Nickisch}, title = {{Gaussian Processes for Machine Learning (GPML) Toolbox}}, journal = jmlr, volume = 11, pages = {3011--3015}, year = 2010, month = {December}, url = {http://www.jmlr.org/papers/volume11/rasmussen10a/rasmussen10a.pdf}, abstract = {The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace's method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.}, annote = {Toolbox avaiable from here. Implements algorithms from Rasmussen and Williams, 2006.} } @inproceedings{RasQui05, cat = {gp}, author = {Carl Edward Rasmussen and Joaquin Qui{\~n}onero-Candela}, title = {Healing the {R}elevance {V}ector {M}achine through Augmentation}, year = 2005, pages = {689--696}, booktitle = icml22, editor = {L.~De Raedt and S.~Wrobel}, abstract = {The Relevance Vector Machine (RVM) is a sparse approximate Bayesian kernel method. It provides full predictive distributions for test cases. However, the predictive uncertainties have the unintuitive property, that \emph{they get smaller the further you move away from the training cases}. We give a thorough analysis. Inspired by the analogy to non-degenerate Gaussian Processes, we suggest augmentation to solve the problem. The purpose of the resulting model, RVM*, is primarily to corroborate the theoretical and experimental analysis. Although RVM* could be used in practical applications, it is no longer a truly sparse model. Experiments show that sparsity comes at the expense of worse predictive distributions.}, location = {Bonn, Germany}, url = {.} } @book{RasWil06, cat = {gp}, author = {Carl Edward Rasmussen and Christopher K.~I.~Williams}, title = {Gaussian Processes for Machine Learning}, publisher = mit, pages = 272, year = 2006, url = {.}, abstract = {Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in the machine-learning community over the past decade, and this book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics.}, annote = {Winner of the 2009 DeGroot Prize. Book web page, chapters and entire book pdf. GPML Toolbox.} } @article{RasWil93, author = {Carl Edward Rasmussen and David J. Willshaw}, title = {Presynaptic and postsynaptic comptetition in models for the development of neuromuscular connections}, journal = {Biological Cybernetics}, publisher = {Springer}, volume = 68, number = 5, pages = {409--419}, year = 1993, url = {http://dx.doi.org/10.1007/BF00198773}, abstract = {In the establishment of connections between nerve and muscle there is an initial stage when each muscle fibre is innervated by several different motor axons. Withdrawal of connections then takes place until each fibre has contact from just a single axon. The evidence suggests that the withdrawal process involves competition between nerve terminals. We examine in formal models several types of competitive mechanism that have been proposed for this phenomenon. We show that a model which combines competition for a presynaptic resource with competition for a postsynaptic resource is superior to others. This model accounts for many anatomical and physiological findings and has a biologically plausible implementation. Intrinsic withdrawal appears to be a side effect of the competitive mechanism rather than a separate non-competitive feature. The model's capabilities are confirmed by theoretical analysis and full scale computer simulations.}, doi = {10.1007/BF00198773} } @article{RavGhaWil02a, author = {Raval, A. and Ghahramani, Zoubin and Wild, David L.}, cat = {bioinf}, journal = {Bioinformatics}, number = 6, pages = {788-801}, title = {A {B}ayesian network model for protein fold and remote homologue recognition}, url = {.}, volume = 18, year = 2002, abstract = {Motivation: The Bayesian network approach is a framework which combines graphical representation and probability theory, which includes, as a special case, hidden Markov models. Hidden Markov models trained on amino acid sequence or secondary structure data alone have been shown to have potential for addressing the problem of protein fold and superfamily classification. Results: This paper describes a novel implementation of a Bayesian network which simultaneously learns amino acid sequence, secondary structure and residue accessibility for proteins of known three-dimensional structure. An awareness of the errors inherent in predicted secondary structure may be incorporated into the model by means of a confusion matrix. Training and validation data have been derived for a number of protein superfamilies from the Structural Classification of Proteins (SCOP) database. Cross validation results using posterior probability classification demonstrate that the Bayesian network performs better in classifying proteins of known structural superfamily than a hidden Markov model trained on amino acid sequences alone.} } @inproceedings{ReeGha13a, author = {Reed, Colorado and Ghahramani, Zoubin}, booktitle = {ICML}, cat = {approx, np}, pages = {1013-1021}, publisher = {JMLR.org}, series = {JMLR Proceedings}, title = {Scaling the {I}ndian Buffet Process via Submodular Maximization}, url = {.}, volume = 28, year = 2013, abstract = {Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)'s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model. } } @article{RojSchTurPet15, title = {A causal perspective on domain adaptation}, author = {Mateo Rojas-Carulla and Bernhard Sch{\"o}lkopf and Richard Turner and Jonas Peters}, journal = {arXiv preprint arXiv:1507.05333)}, url = {http://arxiv.org/abs/1507.05333}, year = 2015, abstract = {From training data from several related domains (or tasks), methods of domain adaptation try to combine knowledge to improve performance. This paper discusses an approach to domain adaptation which is inspired by a causal interpretation of the multi-task problem. We assume that a covariate shift assumption holds true for a subset of predictor variables: the conditional of the target variable given this subset of predictors is invariant with respect to shifts in those predictors (covariates). We propose to learn the corresponding conditional expectation in the training domains and use it for estimation in the target domain. We further introduce a method which allows for automatic inference of the above subset in regression and classification. We study the performance of this approach in an adversarial setting, in the case where no additional examples are available in the test domain. If a labeled sample is available, we provide a method for using both the transferred invariant conditional and task specific information. We present results on synthetic data sets and a sentiment analysis problem.} } @inproceedings{RotVanMooGha10, cat = {ssl}, author = {Rotsos, C. and {Van Gael}, J. and Moore, A.W. and Ghahramani, Z.}, booktitle = {1st International Workshop on Traffic Analysis and Classification (IWCMC '10)}, keywords = {Machine Learning,Network,Semi Supervised}, mendeley-tags ={Machine Learning,Network,Semi Supervised}, title = {Traffic Classification in Information Poor Environments}, url = {.}, year = 2010, month = {July}, address = {Caen, France}, abstract = {Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses \emph{off-the-shelf} machine learning tools and treats them as \emph{black-box} classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.} } @inproceedings{RotVanMooetal10, cat = {gm}, author = {Charalampos Rotsos and Jurgen {Van Gael} and Andrew W. Moore and Zoubin Ghahramani}, year = 2010, title = {Probabilistic Graphical Models for Semi-Supervised Traffic Classification}, booktitle = {The 6th International Wireless Communications and Mobile Computing Conference}, pages = {752--757}, address = {Caen, France}, url = {.}, abstract = {Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.} } @article{RowGha99a, author = {Roweis, Sam T. and Ghahramani, Zoubin}, cat = {review}, journal = {Neural Computation}, number = 2, pages = {305-345}, title = {A Unifying Review of Linear {G}aussian Models}, url = {.}, volume = 11, year = 1999, abstract = {Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model. We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models. } } @inproceedings{RowPacWel17, cat = {gm}, title = {Conditions beyond treewidth for tightness of higher-order {LP} relaxations}, booktitle = aistats20, year = 2017, month = {April}, address = {Fort Lauderdale, Florida}, author = {Mark Rowland and Aldo Pacchiano and Adrian Weller}, url = {http://mlg.eng.cam.ac.uk/adrian/conditions.pdf}, abstract = {Linear programming (LP) relaxations are a popular method to attempt to find a most likely configuration of a discrete graphical model. If a solution to the relaxed problem is obtained at an integral vertex then the solution is guaranteed to be exact and we say that the relaxation is tight. We consider binary pairwise models and introduce new methods which allow us to demonstrate refined conditions for tightness of LP relaxations in the Sherali-Adams hierarchy. Our results include showing that for higher order LP relaxations, treewidth is not precisely the right way to characterize tightness. This work is primarily theoretical, with insights that can improve efficiency in practice.} } @inproceedings{Roy11, cat = {np gm}, author = {Daniel M. Roy}, title = {On the computability and complexity of {B}ayesian reasoning}, booktitle = {NIPS Workshop on Philosophy and Machine Learning}, year = 2011, url = {http://danroy.org/papers/Roy-NIPSPML-2011.pdf}, abstract = { If we consider the claim made by some cognitive scientists that the mind performs Bayesian reasoning, and if we simultaneously accept the Physical Church-Turing thesis and thus believe that the computational power of the mind is no more than that of a Turing machine, then what limitations are there to the reasoning abilities of the mind? I give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics) and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity of Bayesian reasoning. In particular, we prove that conditional probability is in general not computable in the presence of continuous random variables. However, in light of additional structure in the prior distribution, such as the presence of certain types of noise, or of exchangeability, conditioning is possible. These results cover most of statistical practice. At the workshop on Logic and Computational Complexity, we presented results on the computational complexity of conditioning, embedding sharp-P-complete problems in the task of computing conditional probabilities for diffuse continuous random variables. This work complements older work. For example, under cryptographic assumptions, the computational complexity of producing samples and computing probabilities was separated by Ben-David, Chor, Goldreich and Luby. In recent work, we also make use of cryptographic assumptions to show that different representations of exchangeable sequences may have vastly different complexity. However, when faced with an adversary that is computational bounded, these different representations have the same complexity, highlighting the fact that knowledge representation and approximation play a fundamental role in the possibility and plausibility of Bayesian reasoning.} } @phdthesis{Saa11, cat = {gp}, author = {Yunus Saat\c{c}i}, title = {Scalable Inference for Structured {G}aussian Process Models}, school = {University of Cambridge, Department of Engineering}, year = 2011, url = {.}, address = {Cambridge, UK}, abstract = {The generic inference and learning algorithm for Gaussian Process (GP) regression has O(N

The majority of algorithms designed to improve the scaling of GPs are founded on the idea of approximating the true covariance matrix, which is usually of rank N, with a matrix of rank P, where P<<N. Typically, the true training set is replaced with a smaller, representative (pseudo-) training set such that a specific measure of information loss is minimized. These algorithms typically attain O(P

We define a

This thesis studies four more types of structured GP. First, we comprehensively review the case of kernels corresponding to

We illustrate the power of these methods on several real-world regression datasets which satisfy the assumptions inherent in the structured GP employed. In many cases we obtain performance comparable to the generic GP algorithm. We also analyse the performance degradation when these assumptions are not met, and in several cases show that it is comparable to that observed for sparse GP methods. We provide similar results for regression tasks with non-Gaussian likelihoods, an extension rarely addressed by sparse GP techniques.} } @inproceedings{SaaTurRas10, cat = {gp time}, author = {Yunus {Saat\c{c}i} and Ryan Turner and Carl Edward Rasmussen}, title = {Gaussian Process Change Point Models}, booktitle = icml27, pages = {927--934}, year = 2010, month = {June}, address = {Haifa, Israel}, url = {.}, abstract = {We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points. The model can be used to locate change points in an online manner; and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected. We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages. We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.}, annote = {poster, slides.} } @inproceedings{SalRowGha03a, author = {Salakhutdinov, Ruslan and Roweis, Sam T. and Ghahramani, Zoubin}, booktitle = {UAI}, cat = {approx}, editor = {Meek, Christopher and Kj{\ae}rulff, Uffe}, isbn = {0-127-05664-5}, pages = {509-516}, publisher = {Morgan Kaufmann}, title = {On the Convergence of Bound Optimization Algorithms}, url = {.}, year = 2003, abstract = {Many practitioners who use EM and related algorithms complain that they are sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms - including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP - and their relationship to direct optimization algorithms such as gradientbased methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice.} } @inproceedings{SalRowGha03b, author = {Salakhutdinov, Ruslan and Roweis, Sam T. and Ghahramani, Zoubin}, booktitle = {ICML}, cat = {approx}, editor = {Fawcett, Tom and Mishra, Nina}, isbn = {1-57735-189-4}, pages = {672-679}, publisher = {AAAI Press}, title = {Optimization with {EM} and Expectation-Conjugate-Gradient}, url = {.}, year = 2003, abstract = {We show a close relationship between the Expectation- Maximization (EM) algorithm and direct optimization algorithms such as gradientbased methods for parameter learning. We identify analytic conditions under which EM exhibits Newton-like behavior, and conditions under which it possesses poor, first-order convergence. Based on this analysis, we propose two novel algorithms for maximum likelihood estimation of latent variable models, and report empirical results showing that, as predicted by theory, the proposed new algorithms can substantially outperform standard EM in terms of speed of convergence in certain cases. } } @article{SavGhaGrietal10, cat = {bioinf}, author = {R. S. Savage and Z. Ghahramani and J. E. Griffin and B. de la Cruz and D. L. Wild}, year = 2010, title = {Discovering Transcriptional Modules by {Bayesian} Data Integration}, journal = {Bioinformatics}, volume = 26, pages = {i158--i167}, url = {.}, abstract = {Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.

Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs.

Availability: If interested in the code for the work presented in this article, please contact the authors.} } @article{SavHelXuetal09, cat = {clust bioinf}, volume = 10, number = 242, pages = {1--9}, month = {August}, author = {R.~Savage and K.~A.~Heller and Y.~Xu and Zoubin Ghahramani and W.~Truman and M.~Grant and K.~Denby and D.~L.~Wild}, title = {{R/BHC}: fast {B}ayesian hierarchical clustering for microarray data}, publisher = {BioMed Central}, journal = {BMC Bioinformatics 2009}, year = 2009, url = {.}, abstract = {Background: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained.

Results: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge.

Conclusion: Biologically plausible results are presented from a well studied data set: expression profiles of \emph{A. thaliana} subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.}, doi = {10.1186/1471-2105-10-242}, issn = {1471-2105}, pubmedid = 19660130 } @inproceedings{Sch09, title = {Function factorization using warped {G}aussian processes}, author = {Mikkel N. Schmidt}, booktitle = icml26, address = {Montr\'{e}al, QC, Canada}, pages = {921--928}, editor = {L\'{e}on Bottou and Michael Littman}, month = {June}, publisher = {Omnipress}, year = 2009, url = {.}, abstract = {We introduce a new approach to non-linear regression called function factorization, that is suitable for problems where an output variable can reasonably be modeled by a number of multiplicative interaction terms between non-linear functions of the inputs. The idea is to approximate a complicated function on a high-dimensional space by the sum of products of simpler functions on lower-dimensional subspaces. Function factorization can be seen as a generalization of matrix and tensor factorization methods, in which the data are approximated by the sum of outer products of vectors. We present a non-parametric Bayesian approach to function factorization where the priors over the factorizing functions are warped Gaussian processes, and we do inference using Hamiltonian Markov chain Monte Carlo. We demonstrate the superior predictive performance of the method on a food science data set compared to Gaussian process regression and tensor factorization using PARAFAC and GEMANOVA models.}, annote = {slides. poster. video.} } @inproceedings{Sch09b, title = {Linearly constrained {B}ayesian matrix factorization for blind source separation}, author = {Mikkel N. Schmidt}, booktitle = nips22, address = {Cambridge, MA, USA}, publisher = mit, editor = {Y. Bengio and D. Schuurmans and J. Lafferty and C. K. I. Williams and A. Culotta}, month = {December}, pages = {1624--1632}, year = 2009, url = {.}, abstract = {We present a general Bayesian approach to probabilistic matrix factorization subject to linear constraints. The approach is based on a Gaussian observation model and Gaussian priors with bilinear equality and inequality constraints. We present an efficient Markov chain Monte Carlo inference procedure based on Gibbs sampling. Special cases of the proposed model are Bayesian formulations of non-negative matrix factorization and factor analysis. The method is evaluated on a blind source separation problem. We demonstrate that our algorithm can be used to extract meaningful and interpretable features that are remarkably different from features extracted using existing related matrix factorization techniques.}, annote = {code.} } @article{ SchDuvHen2014, cat = {gp}, title = "Probabilistic {ODE} Solvers with {R}unge-{K}utta Means", author = "Michael Schober and David Duvenaud and Philipp Hennig", url = {http://arxiv.org/abs/1406.2582}, journal = "arXiv preprint arXiv:1406.2582", month = {June}, year = 2014, abstract = {Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state-of-the-art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.} } @inproceedings{SchMoh09, title = {Probabilistic non-negative tensor factorization using {M}arkov chain {M}onte {C}arlo}, author = {Mikkel N. Schmidt and Shakir Mohamed}, booktitle = {European Signal Processing Conference (EUSIPCO)}, address = {Glasgow, Scotland}, month = {August}, year = 2009, pages = {1918--1922}, url = {.}, abstract = {We present a probabilistic model for learning non-negative tensor factorizations (NTF), in which the tensor factors are latent variables associated with each data dimension. The non-negativity constraint for the latent factors is handled by choosing priors with support on the non-negative numbers. Two Bayesian inference procedures based on Markov chain Monte Carlo sampling are described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We evaluate the model on two food science data sets, and show that the probabilistic NTF model leads to better predictions and avoids overfitting compared to existing NTF approaches.}, annote = {Rated by reviewers amongst the top 5% of the presented papers.} } @article{SchVenKnoetal11, cat = {bioinf}, author = {Cornelia Schone and Anne Venner and David A. Knowles and Mahesh M Karnani and Denis Burdakov}, title = {Dichotomous cellular properties of mouse orexin/hypocretin neurons}, url = {http://jp.physoc.org/content/early/2011/04/11/jphysiol.2011.208637.abstract}, journal = {The Journal of Physiology}, year = 2011, abstract = {Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as critical regulators of sleep-wake cycles, reward-seeking, and body energy balance. However, at the level of cellular and network properties, it remains unclear whether hcrt/orx neurons are one homogenous population, or whether there are several distinct types of hcrt/orx cells. Here, we collated diverse structural and functional information about individual hcrt/orx neurons in mouse brain slices, by combining patch-clamp analysis of spike firing, membrane currents, and synaptic inputs with confocal imaging of cell shape and subsequent 3-dimensional Sholl analysis of dendritic architecture. Statistical cluster analysis of intrinsic firing properties revealed that hcrt/orx neurons fall into two distinct types. These two cell types also differ in the complexity of their dendritic arbour, the strength of AMPA and GABAA receptor-mediated synaptic drive that they receive, and the density of low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results provide quantitative evidence that, at the cellular level, the mouse hcrt/orx system is composed of two classes of neurons with different firing properties, morphologies, and synaptic input organization.} } @inproceedings{SchWinHan09, title = {Bayesian non-negative matrix factorization}, author = {Mikkel N. Schmidt and Ole Winther and Lars Kai Hansen}, booktitle = {8th International Conference on Independent Component Analysis and Signal Separation}, pages = {540--547}, publisher = {Springer}, address = {Paraty, Brazil}, series = lncs, volume = 5441, year = 2009, month = {March}, url = {.}, abstract = {We present a Bayesian treatment of non-negative matrix factorization (NMF), based on a normal likelihood and exponential priors, and derive an efficient Gibbs sampler to approximate the posterior density of the NMF factors. On a chemical brain imaging data set, we show that this improves interpretability by providing uncertainty estimates. We discuss how the Gibbs sampler can be used for model order selection by estimating the marginal likelihood, and compare with the Bayesian information criterion. For computing the maximum a posteriori estimate we present an iterated conditional modes algorithm that rivals existing state-of-the-art NMF algorithms on an image feature extraction problem.}, annote = {slides. code.} } @inproceedings{SciGhaGor15, author = {Adam \'Scibior and Zoubin Ghahramani and Andrew D. Gordon}, title = {Practical Probabilistic Programming with Monads}, booktitle = {Proceedings of the 8th ACM SIGPLAN Symposium on Haskell}, year = 2015, publisher = acm, url = {.}, abstract = {The machine learning community has recently shown a lot of interest in practical probabilistic programming systems that target the problem of Bayesian inference. Such systems come in different forms, but they all express probabilistic models as computational processes using syntax resembling programming languages. In the functional programming community monads are known to offer a convenient and elegant abstraction for programming with probability distributions, but their use is often limited to very simple inference problems. We show that it is possible to use the monad abstraction to construct probabilistic models for machine learning, while still offering good performance of inference in challenging models. We use a GADT as an underlying representation of a probability distribution and apply Sequential Monte Carlo-based methods to achieve efficient inference. We define a formal semantics via measure theory. We demonstrate a clean and elegant implementation that achieves performance comparable with Anglican, a state-of-the-art probabilistic programming system.}, doi = {10.1145/2804302.2804317} } @article{ShaGha13a, author = {Shah, Amar and Ghahramani, Zoubin}, cat = {clust, np}, journal = {UAI}, title = {Determinantal Clustering Processes - {A} Nonparametric {B}ayesian Approach to Kernel Based Semi-Supervised Clustering}, url = {.}, year = 2013, abstract = {Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the number of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets. } } @inproceedings{ShaQuaLam12, cat = {mvision}, title = {Augmented Attributes Representations}, author = {Viktoriia Sharmanska and Novi Quadrianto and Christoph Lampert}, booktitle = {12th European Conference on Computer Vision}, pages = {242--255}, year = 2012, url = {.}, abstract = {We propose a new learning method to infer a mid-level feature representation that combines the advantage of semantic attribute representations with the higher expressive power of non-semantic features. The idea lies in augmenting an existing attribute-based representation with additional dimensions for which an autoencoder model is coupled with a large-margin principle. This construction allows a smooth transition between the zero-shot regime with no training example, the unsupervised regime with training examples but without class labels, and the supervised regime with training examples and with class labels. The resulting optimization problem can be solved efficiently, because several of the necessity steps have closed-form solutions. Through extensive experiments we show that the augmented representation achieves better results in terms of object categorization accuracy than the semantic representation alone.} } @inproceedings{ShaQuaLam13, cat = {mvision}, title = {Learning to Rank Using Privileged Information}, author = {Viktoriia Sharmanska and Novi Quadrianto and Christoph Lampert}, booktitle = {International Conference on Computer Vision}, year = 2013, url = {.}, abstract = {Many computer vision problems have an asymmetric distribution of information between training and test time. In this work, we study the case where we are given additional information about the training data, which however will not be available at test time. This situation is called learning using privileged information (LUPI). We introduce two maximum-margin techniques that are able to make use of this additional source of information, and we show that the framework is applicable to several scenarios that have been studied in computer vision before. Experiments with attributes, bounding boxes, image tags and rationales as additional information in object classification show promising results.} } @inproceedings{ShaWilGha14a, author = {Shah, Amar and Wilson, Andrew Gordon and Ghahramani, Zoubin}, booktitle = {AISTATS}, cat = {gp}, publisher = {JMLR.org}, series = {JMLR Proceedings}, title = {Student-t Processes as Alternatives to {G}aussian Processes}, url = {.}, year = 2014, abstract = {We investigate the Student-t process as an alternative to the Gaussian process as a nonparametric prior over functions. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating away an inverse Wishart process prior over the covariance kernel of a Gaussian process model. We show surprising equivalences between different hierarchical Gaussian process models leading to Student-t processes, and derive a new sampling scheme for the inverse Wishart process, which helps elucidate these equivalences. Overall, we show that a Student-t process can retain the attractive properties of a Gaussian process -- a nonparametric representation, analytic marginal and predictive distributions, and easy model selection through covariance kernels -- but has enhanced flexibility, and predictive covariances that, unlike a Gaussian process, explicitly depend on the values of training observations. We verify empirically that a Student-t process is especially useful in situations where there are changes in covariance structure, or in applications like Bayesian optimization, where accurate predictive covariances are critical for good performance. These advantages come at no additional computational cost over Gaussian processes. } } @inproceedings{SilChuGha08, cat = {gm}, booktitle = nips20, month = {December}, title = {Hidden common cause relations in relational learning}, author = {R.~Silva and W.~Chu and Zoubin Ghahramani}, year = 2008, address = {Cambridge, MA, USA}, publisher = mit, editor = {J.~C.~Platt and D.~Koller and Y.~Singer and S.~Roweis}, pages = {1345--1352}, url = {.}, abstract = {When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to \emph{hidden common causes}. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies.}, annote = {Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/xgp} } @inproceedings{SilGha06a, author = {Silva, Ricardo and Ghahramani, Zoubin}, booktitle = {UAI}, cat = {gm}, isbn = {0-9749039-2-2}, publisher = {AUAI Press}, title = {Bayesian Inference for {G}aussian Mixed Graph Models}, url = {.}, year = 2006, abstract = {We introduce priors and algorithms to perform Bayesian inference in Gaussian models defined by acyclic directed mixed graphs. Such a class of graphs, composed of directed and bi-directed edges, is a representation of conditional independencies that is closed under marginalization and arises naturally from causal models which allow for unmeasured confounding. Monte Carlo methods and a variational approximation for such models are presented. Our algorithms for Bayesian inference allow the evaluation of posterior distributions for several quantities of interest, including causal effects that are not identifiable from data alone but could otherwise be inferred where informative prior knowledge about confounding is available.} } @article{SilGha09, cat = {gm}, volume = 10, month = {June}, author = {R. Silva and Z. Ghahramani}, title = {The hidden life of latent variables: {Bayesian} learning with mixed graph models}, publisher = {Association for Computing Machinery}, journal = jmlr, pages = {1187--1238}, year = 2009, url = {.}, abstract = {Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for inference is the specification of the distribution over parameters of the models. We introduce a new distribution for covariance matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian machine learning tasks can benefit from the principle presented here: the power to model dependencies that are generated from hidden variables, but without necessarily modeling such variables explicitly.} } @inproceedings{SilGha09b, cat = {gm}, volume = 5, author = {R. Silva and Z. Ghahramani}, annote = {Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip}, note = {ISSN: 1938-7228}, booktitle = aistats12, title = {Factorial mixture of {Gaussians} and the marginal independence model}, publisher = jmlr, pages = {520--527}, year = 2009, month = {April}, address = {Clearwater Beach, FL, USA}, url = {.}, abstract = {Marginal independence constraints play an important role in learning with graphical models. One way of parameterizing a model of marginal independencies is by building a latent variable model where two independent observed variables have no common latent source. In sparse domains, however, it might be advantageous to model the marginal observed distribution directly, without explicitly including latent variables in the model. There have been recent advances in Gaussian and binary models of marginal independence, but no models with non-linear dependencies between continuous variables has been proposed so far. In this paper, we describe how to generalize the Gaussian model of marginal independencies based on mixtures, and how to learn parameters. This requires a non-standard parameterization and raises difficult non-linear optimization issues.} } @inproceedings{SilHelGha07a, author = {Silva, Ricardo and Heller, Katherine A. and Ghahramani, Zoubin}, booktitle = {AISTATS}, cat = {ir}, editor = {Meila, Marina and Shen, Xiaotong}, pages = {500-507}, publisher = {JMLR.org}, series = {JMLR Proceedings}, title = {Analogical Reasoning with Relational {B}ayesian Sets}, url = {.}, volume = 2, year = 2007, abstract = {Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. There are many ways in which objects can be related, making automated analogical reasoning very chal- lenging. Here we develop an approach which, given a set of pairs of related objects S = {A1:B1,A2:B2,...,AN:BN}, measures how well other pairs A:B fit in with the set S. This addresses the question: is the relation between objects A and B analogous to those relations found in S? We recast this classi- cal problem as a problem of Bayesian analy- sis of relational data. This problem is non- trivial because direct similarity between ob- jects is not a good way of measuring analo- gies. For instance, the analogy between an electron around the nucleus of an atom and a planet around the Sun is hardly justified by isolated, non-relational, comparisons of an electron to a planet, and a nucleus to the Sun. We develop a generative model for predicting the existence of relationships and extend the framework of Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous a relation is to other relations. This sheds new light on an old problem, which we motivate and illustrate through practical applications in exploratory data analysis.} } @article{SilHelGhaetal10, cat = {ir bioinf}, author = {R. Silva and K. A. Heller and Z. Ghahramani and E. M. Airoldi}, year = 2010, pages = {615--644}, volume = 4, number = 2, title = {Ranking Relations Using Analogies in Biological and Information Networks}, journal = {Annals of Applied Statistics}, url = {.}, abstract = {Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = {A(1):B(1), A(2):B(2), ..., A(N):B(N)}, measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.} } @inproceedings{SimSciToletal16, cat = {gp}, author = {Carl-Johann Simon-Gabriel and Adam \'Scibior and Ilya Tolstikhin and Bernhard Sch\"olkopf}, title = {Consistent Kernel Mean Estimation for Functions of Random Variables}, booktitle = nips30, year = 2016, url = {.}, abstract = {We provide a theoretical foundation for non-parametric estimation of functions of random variables using kernel mean embeddings. We show that for any continuous function f, consistent estimators of the mean embedding of a random variable X lead to consistent estimators of the mean embedding of f(X). For Mat\'ern kernels and sufficiently smooth functions we also provide rates of convergence. Our results extend to functions of multiple random variables. If the variables are dependent, we require an estimator of the mean embedding of their joint distribution as a starting point; if they are independent, it is sufficient to have separate estimators of the mean embeddings of their marginal distributions. In either case, our results cover both mean embeddings based on i.i.d. samples as well as "reduced set" expansions in terms of dependent expansion points. The latter serves as a justification for using such expansions to limit memory resources when applying the approach as a basis for probabilistic programming.} } @inproceedings{SinQuiBaketal04, cat = {gp}, author = {Fabian Sinz and Joaquin Qui{\~n}onero-Candela and G{\"o}khan H.~Bakir and Carl Edward Rasmussen and Matthias O.~Franz}, title = {Learning Depth From Stereo}, year = 2004, series = lncs, volume = 3175, publisher = {Springer}, pages = {245--252}, month = 09, journal = {Pattern Recognition: Proceedings of the 26th DAGM Symposium}, editor = {C.~E.~Rasmussen and H.~H.~B{\"u}lthoff and B.~Sch{\"o}lkopf and M.~A.~Giese}, address = {Berlin, Germany}, abstract = {We compare two approaches to the problem of estimating the depth of a point in space from observing its image position in two different cameras: 1.~The classical photogrammetric approach explicitly models the two cameras and estimates their intrinsic and extrinsic parameters using a tedious calibration procedure; 2.~A generic machine learning approach where the mapping from image to spatial coordinates is directly approximated by a Gaussian Process regression. Our results show that the generic learning approach, in addition to simplifying the procedure of calibration, can lead to higher depth accuracies than classical calibration although no specific domain knowledge is used.}, booktitle = {26th DAGM Symposium}, location = {T{\"u}bingen, Germany}, url = {.} } @inproceedings{SkoSjaKok17, cat = {sigproc}, author = {Martin A. Skoglund and Zoran Sjanic and Manon Kok}, title = {On orientation estimation using iterative methods in {E}uclidean space}, booktitle = {International Conference on Information Fusion}, year = 2017, month = {July}, address = {Xi'an, China}, url = {dx.doi.org/10.23919/ICIF.2017.8009830}, abstract = {This paper presents three iterative methods for orientation estimation. The first two are based on iterated Extended Kalman filter (IEKF) formulations with different state representations. The first is using the well-known unit quaternion as state (q-IEKF) while the other is using orientation deviation which we call IMEKF. The third method is based on nonlinear least squares (NLS) estimation of the angular velocity which is used to parametrise the orientation. The results are obtained using Monte Carlo simulations and the comparison is done with the non-iterative EKF and multiplicative EKF (MEKF) as baseline. The result clearly shows that the IMEKF and the NLS-based method are superior to q-IEKF and all three outperform the non-iterative methods.} } @InProceedings{SneGha05, cat = {approx}, author = {Edward Snelson and Zoubin Ghahramani}, title = {Compact Approximations to {B}ayesian Predictive Distributions}, booktitle = icml22, year = 2005, address = {Bonn, Germany}, month = {August}, URL = {http://www.gatsby.ucl.ac.uk/~snelson/Bayes_pred.pdf}, publisher = {Omnipress}, abstract = {We provide a general framework for learning precise, compact, and fast representations of the Bayesian predictive distribution for a model. This framework is based on minimizing the KL divergence between the true predictive density and a suitable compact approximation. We consider various methods for doing this, both sampling based approximations, and deterministic approximations such as expectation propagation. These methods are tested on a mixture of Gaussians model for density estimation and on binary linear classification, with both synthetic data sets for visualization and several real data sets. Our results show significant reductions in prediction time and memory footprint.} } @InCollection{SneGha06, cat = {gp}, title = {Sparse {G}aussian Processes using Pseudo-inputs}, author = {Edward Snelson and Zoubin Ghahramani}, booktitle = nips18, editor = {Y. Weiss and B. Sch\"{o}lkopf and J. Platt}, publisher = MIT, address = {Cambridge, MA}, pages = {1257--1264}, year = 2006, url = {http://www.gatsby.ucl.ac.uk/~snelson/SPGP_up.pdf}, abstract = {We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M<<N, where N is the number of real data points, and hence obtain a sparse regression method which has O(NM

Gaussian processes are capable of generalizing standard linear time series models. We cover two approaches: the Gaussian process time series model (GPTS) and the autoregressive Gaussian process (ARGP). We cover a variety of methods that greatly reduce the computational and memory complexity of Gaussian process approaches, which are generally cubic in computational complexity.

Two different improvements to state space based approaches are covered. First, Gaussian process inference and learning (GPIL) generalizes linear dynamical systems (LDS), for which the Kalman filter is based, to general nonlinear systems for nonparametric system identification. Second, we address pathologies in the unscented Kalman filter (UKF). We use Gaussian process optimization (GPO) to learn UKF settings that minimize the potential for sigma point collapse.

We show how to embed mentioned Gaussian process approaches to time series into a change point framework. Old data, from an old regime, that hinders predictive performance is automatically and elegantly phased out. The computational improvements for Gaussian process time series approaches are of even greater use in the change point framework. We also present a supervised framework learning a change point model when change point labels are available in training.

These mentioned methodologies significantly improve predictive performance on the diverse set of data sets selected.} } @inproceedings{TurBotGha10, cat = {time}, author = {Ryan Turner and Steven Bottone and Zoubin Ghahramani}, title = {Fast Online Anomaly Detection Using Scan Statistics}, booktitle = {Machine Learning for Signal Processing (MLSP 2010)}, year = 2010, address = {Kittil\"{a}, Finland}, month = {August}, pages = {385--390}, isbn = {978-1-4244-7876-7}, editor = {Samuel Kaski and David J. Miller and Erkki Oja and Antti Honkela}, abstract = {We present methods to do fast online anomaly detection using scan statistics. Scan statistics have long been used to detect statistically significant bursts of events. We extend the scan statistics framework to handle many practical issues that occur in application: dealing with an unknown background rate of events, allowing for slow natural changes in background frequency, the inverse problem of finding an unusual lack of events, and setting the test parameters to maximize power. We demonstrate its use on real and synthetic data sets with comparison to other methods.}, url = {.} } @inproceedings{TurDeiRas09, cat = {gp time}, author = {Ryan Turner and Marc Peter Deisenroth and Carl Edward Rasmussen}, title = {System Identification in {G}aussian Process Dynamical Systems}, booktitle = {NIPS Workshop on Nonparametric Bayes}, year = 2009, editor = {Dilan G\"or\"ur}, address = {Whistler, BC, Canada}, month = {December}, url = {.}, annote = {poster.} } @inproceedings{TurDeiRas10, cat = {gp time}, author = {Ryan Turner and Marc Peter Deisenroth and Carl Edward Rasmussen}, title = {State-Space Inference and Learning with {G}aussian Processes}, booktitle = aistats13, year = 2010, editor = {Yee Whye Teh and Mike Titterington}, volume = 9, series = {W \& CP}, pages = {868--875}, address = {Chia Laguna, Sardinia, Italy}, month = {May 13--15}, organization = jmlr, abstract = {State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model.}, url = {.}, annote = {poster.} } @inproceedings{TurRas10, cat = {gp time}, author = {Ryan Turner and Carl Edward Rasmussen}, title = {Model Based Learning of Sigma Points in Unscented {K}alman Filtering}, booktitle = {Machine Learning for Signal Processing (MLSP 2010)}, year = 2010, address = {Kittil\"{a}, Finland}, pages = {178--183}, month = {August}, isbn = {978-1-4244-7876-7}, editor = {Samuel Kaski and David J. Miller and Erkki Oja and Antti Honkela}, abstract = {The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for a step known as sigma point placement, causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.}, url = {.} } @article{TurRas12, cat = {gp time}, author = {Ryan D.~Turner and Carl Edward Rasmussen}, volume = 80, title = {Model based learning of sigma points in unscented {K}alman filtering}, journal = {Neurocomputing}, pages = {47--53}, year = 2012, url = {.}, doi = {10.1016/j.neucom.2011.07.029}, abstract = {The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.} } @inproceedings{TurSaaRas09, cat = {time}, author = {Ryan Turner and Yunus Saat\c{c}i and Carl Edward Rasmussen}, title = {Adaptive Sequential {B}ayesian Change Point Detection}, booktitle = {NIPS Workshop on Temporal Segmentation}, year = 2009, editor = {Za\"{i}d Harchaoui}, address = {Whistler, BC, Canada}, month = {December}, abstract = {Real-world time series are often nonstationary with respect to the parameters of some underlying prediction model (UPM). Furthermore, it is often desirable to adapt the UPM to incoming regime changes as soon as possible, necessitating sequential inference about change point locations. A Bayesian algorithm for online change point detection (BOCPD) has been introduced recently by Adams and MacKay (2007). In this algorithm, uncertainty about the last change point location is updated sequentially, and is integrated out to make online predictions robust to parameter changes. BOCPD requires a set of fixed hyper-parameters which allow the user to fully specify the hazard function for change points and the prior distribution over the parameters of the UPM. In practice, finding the "right" hyper-parameters can be quite difficult. We therefore extend BOCPD by introducing hyper-parameter learning, without sacrificing the online nature of the algorithm. Hyper-parameter learning is performed by optimizing the marginal likelihood of the BOCPD model, a closed-form quantity which can be computed sequentially. We illustrate performance on three real-world datasets.}, url = {.}, annote = {poster, slides.} } @inproceedings{TurSah07b, cat = {time approx mhearing sigproc}, author = {Richard E. Turner and M Sahani}, title = {Probabilistic Amplitude Demodulation}, booktitle = {7th International Conference on Independent Component Analysis and Signal Separation}, year = 2007, pages = {544--551}, abstract = {Auditory scene analysis is extremely challenging. One approach, perhaps that adopted by the brain, is to shape useful representations of sounds on prior knowledge about their statistical structure. For example, sounds with harmonic sections are common and so time-frequency representations are efficient. Most current representations concentrate on the shorter components. Here, we propose representations for structures on longer time-scales, like the phonemes and sentences of speech. We decompose a sound into a product of processes, each with its own characteristic time-scale. This demodulation cascade relates to classical amplitude demodulation, but traditional algorithms fail to realise the representation fully. A new approach, probabilistic amplitude demodulation, is shown to out-perform the established methods, and to easily extend to representation of a full demodulation cascade.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2007b.html}, } @inproceedings{TurSah08a, cat = {time approx sigproc mhearing}, author = {Richard E. Turner and Maneesh Sahani}, title = {Modeling natural sounds with modulation cascade processes}, booktitle = {nips20}, year = 2008, editor = {J. C. Platt and D. Koller and Y. Singer and S. Roweis}, volume = 20, publisher = {mit}, abstract = {Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences ($\sim1$s); phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants ($\sim 10$−$3$s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscience-inspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2008a.html}, } @InProceedings{TurSah10a, cat = {approx time sigproc mhearing}, author = {Richard E. Turner and Maneesh Sahani}, title = {Statistical inference for single- and multi-band probabilistic amplitude demodulation.}, booktitle = {Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, pages = {5466--5469 }, year = 2010, abstract = {Amplitude demodulation is an ill-posed problem and so it is natural to treat it from a Bayesian viewpoint, inferring the most likely carrier and envelope under probabilistic constraints. One such treatment is Probabilistic Amplitude Demodulation (PAD), which, whilst computationally more intensive than traditional approaches, offers several advantages. Here we provide methods for estimating the uncertainty in the PAD-derived envelopes and carriers, and for learning free-parameters like the time-scale of the envelope. We show how the probabilistic approach can naturally handle noisy and missing data. Finally, we indicate how to extend the model to signals which contain multiple modulators and carriers.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2010a.html}, } @incollection{TurSah11a, cat = {approx time review}, author = {Richard E. Turner and Maneesh Sahani}, title = {Two problems with variational expectation maximisation for time-series models}, booktitle = {Bayesian Time series models}, pages = {109--130}, publisher = {Cambridge University Press}, year = 2011, editor = {D. Barber and T. Cemgil and S. Chiappa}, chapter = 5, abstract = {Variational methods are a key component of the approximate inference and learning toolbox. These methods fill an important middle ground, retaining distributional information about uncertainty in latent variables, unlike maximum a posteriori methods (MAP), and yet generally requiring less computational time than Monte Carlo Markov Chain methods. In particular the variational Expectation Maximisation (vEM) and variational Bayes algorithms, both involving variational optimisation of a free-energy, are widely used in time-series modelling. Here, we investigate the success of vEM in simple probabilistic time-series models. First we consider the inference step of vEM, and show that a consequence of the well-known compactness property of variational inference is a failure to propagate uncertainty in time, thus limiting the usefulness of the retained distributional information. In particular, the uncertainty may appear to be smallest precisely when the approximation is poorest. Second, we consider parameter learning and analytically reveal systematic biases in the parameters found by vEM. Surprisingly, simpler variational approximations (such a mean-field) can lead to less bias than more complicated structured approximations.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2011a.html}, } @article{TurSah11b, cat = {gp time mhearing sigproc}, author = {Richard E. Turner and Maneesh Sahani}, title = {Demodulation as Probabilistic Inference}, journal = {Transactions on Audio, Speech and Language Processing}, year = 2011, volume = 19, issue = 8, pages = {2398--2411}, abstract = {Demodulation is an ill-posed problem whenever both carrier and envelope signals are broadband and unknown. Here, we approach this problem using the methods of probabilistic inference. The new approach, called Probabilistic Amplitude Demodulation (PAD), is computationally challenging but improves on existing methods in a number of ways. By contrast to previous approaches to demodulation, it satisfies five key desiderata: PAD has soft constraints because it is probabilistic; PAD is able to automatically adjust to the signal because it learns parameters; PAD is user-steerable because the solution can be shaped by user-specific prior information; PAD is robust to broad-band noise because this is modelled explicitly; and PAD’s solution is self-consistent, empirically satisfying a Carrier Identity property. Furthermore, the probabilistic view naturally encompasses noise and uncertainty, allowing PAD to cope with missing data and return error bars on carrier and envelope estimates. Finally, we show that when PAD is applied to a bandpass-filtered signal, the stop-band energy of the inferred carrier is minimal, making PAD well-suited to sub-band demodulation.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2011b.html}, } @incollection{TurSah11c, cat = {gp time mhearing sigproc}, title = {Probabilistic amplitude and frequency demodulation}, author = {Richard E. Turner and Maneesh Sahani}, booktitle = nips24, publisher = mit, pages = {981--989}, year = 2011, abstract = {A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequency-demodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2011c.html}, } @inproceedings{TurSah12a, cat = {sigproc mhearing sigproc}, author = {Richard E. Turner and Maneesh Sahani}, booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on}, abstract = {There are many methods for decomposing signals into a sum of amplitude and frequency modulated sinusoids. In this paper we take a new estimation based approach. Identifying the problem as ill-posed, we show how to regularize the solution by imposing soft constraints on the amplitude and phase variables of the sinusoids. Estimation proceeds using a version of Kalman smoothing. We evaluate the method on synthetic and natural, clean and noisy signals, showing that it outperforms previous decompositions, but at a higher computational cost.}, title = {Decomposing signals into a sum of amplitude and frequency modulated sinusoids using probabilistic inference}, year = 2012, month = {march}, pages = {2173--2176}, doi = {10.1109/ICASSP.2012.6288343}, ISSN = {1520-6149}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2012a.html}, } @article{TurSah2007a, cat = {time mvision neuro}, author = {Richard E. Turner and Maneesh Sahani}, title = {A Maximum-Likelihood Interpretation for Slow Feature Analysis}, journal = {nc}, year = 2007, volume = 19, pages = {1022--1038}, number = 4, address = {Cambridge, MA, USA}, doi = {http://dx.doi.org/10.1162/neco.2007.19.4.1022}, issn = {0899-7667}, publisher = {MIT Press}, abstract = {The brain extracts useful features from a maelstrom of sensory information, and a fundamental goal of theoretical neuroscience is to work out how it does so. One proposed feature extraction strategy is motivated by the observation that the meaning of sensory data, such as the identity of a moving visual object, is often more persistent than the activation of any single sensory receptor. This notion is embodied in the slow feature analysis (SFA) algorithm, which uses “slowness” as an heuristic by which to extract semantic information from multi-dimensional time-series. Here, we develop a probabilistic interpretation of this algorithm showing that inference and learning in the limiting case of a suitable probabilistic model yield exactly the results of SFA. Similar equivalences have proved useful in interpreting and extending comparable algorithms such as independent component analysis. For SFA, we use the equivalent probabilistic model as a conceptual spring-board, with which to motivate several novel extensions to the algorithm.}, url = {http://www.gatsby.ucl.ac.uk/~turner/Publications/turner-and-sahani-2007a.html}, } @article{UedGha02a, author = {Ueda, Naonori and Ghahramani, Zoubin}, journal = {Neural Networks}, number = 10, pages = {1223-1241}, title = {Bayesian model search for mixture models based on optimizing variational bounds}, url = {.}, volume = 15, year = 2002, abstract = {When learning a mixture model, we suffer from the local optima and model structure determination problems. In this paper, we present a method for simultaneously solving these problems based on the variational Bayesian (VB) framework. First, in the VB framework, we derive an objective function that can simultaneously optimize both model parameter distributions and model structure. Next, focusing on mixture models, we present a deterministic algorithm to approximately optimize the objective function by using the idea of the split and merge operations which we previously proposed within the maximum likelihood framework. Then, we apply the method to mixture of expers (MoE) models to experimentally show that the proposed method can find the optimal number of experts of a MoE while avoiding local maxima. q 2002 Elsevier Science Ltd. All rights reserved. } } @article{UedNakGha00a, author = {Ueda, Naonori and Nakano, Ryohei and Ghahramani, Zoubin and Hinton, Geoffrey E.}, cat = {clust}, journal = {Neural Computation}, number = 9, pages = {2109-2128}, title = {{SMEM} Algorithm for Mixture Models}, url = {.}, volume = 12, year = 2000, abstract = {We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split-and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems. } } @article{UedNakGha00b, author = {Ueda, Naonori and Nakano, Ryohei and Ghahramani, Zoubin and Hinton, Geoffrey E.}, cat = {clust}, journal = {VLSI Signal Processing}, number = {1-2}, pages = {133-140}, title = {Split and Merge {EM} Algorithm for Improving {G}aussian Mixture Density Estimates}, url = {.}, volume = 26, year = 2000, abstract = {We present a split and merge EM algorithm to overcome the local maximum problem in Gaussian mixture density estimation. Nonglobal maxims often involve having too many Gaussians in one part of the space and too few in another, widely separated part of the space. To escape from such configurations we repeatedly perform split and merge operations using a new criterion for efficiently selecting the split and merge candidates. Experimental results on synthetic and real data show the effectiveness of using the split and merge operations to improve the likelihood of both the training data and of held-out test data } } @inproceedings{UedNakGha98a, author = {Ueda, Naonori and Nakano, Ryohei and Ghahramani, Zoubin and Hinton, Geoffrey E.}, booktitle = {NIPS}, cat = {clust}, editor = {Kearns, Michael J. and Solla, Sara A. and Cohn, David A.}, isbn = {0-262-11245-0}, pages = {599-605}, publisher = {The MIT Press}, title = {{SMEM} Algorithm for Mixture Models}, url = {.}, year = 1998, abstract = {We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split- and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems. } } @inproceedings{VanSaaTehGha08, cat = {np mcmc time}, abstract = {The infinite hidden Markov model is a non-parametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite Hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.}, address = {Helsinki, Finland}, author = {Jurgen {Van Gael} and Yunus Saat\c{c}i and Yee-Whye Teh and Zoubin Ghahramani}, booktitle = icml25, pages = {1088--1095}, publisher = acm, title = {Beam sampling for the infinite hidden {M}arkov model}, url = {.}, volume = 25, year = 2008 } @inproceedings{VanTehGha08, cat = {np time}, author = {{Van Gael}, J. and Teh, Y.W. and Ghahramani, Z.}, booktitle = nips21, editor = {Koller, D. and Schuurmans, D. and Bottou, L. and Bengio, Y.}, pages = {1697--1704}, address = {Cambridge, MA, USA}, publisher = mit, title = {The infinite factorial hidden {M}arkov model}, url = {.}, volume = 21, year = 2008, month = {December}, abstract = {The infinite factorial hidden Markov model is a non-parametric extension of the factorial hidden Markov model. Our model defines a probability distribution over an infinite number of independent binary hidden Markov chains which together produce an observable sequence of random variables. Central to our model is a new type of non-parametric prior distribution inspired by the Indian Buffet Process which we call the \emph{Indian Buffet Markov Process}.} } @inproceedings{VanVlaGha09, cat = {np,nlp}, author = {{Van Gael}, J. and Vlachos, A. and Ghahramani, Z.}, booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, title = {The infinite {HMM} for unsupervised {PoS} tagging}, pages = {678--687}, isbn = {978-1-932432-62-6}, url = {.}, year = 2009, month = {August}, publisher = {Association for Computational Linguistics}, address = {Singapore}, abstract = {We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations.} } @inproceedings{VanZhu07, author = {{Van Gael}, J. and Zhu, X.}, booktitle = ijcai, title = {Correlation clustering for crosslingual link detection}, url = {.}, pages = {1744--1749}, year = 2007, month = {January}, address = {Hyderabad, India}, editor = {Manuela M. Veloso}, abstract = {The crosslingual link detection problem calls for identifying news articles in multiple languages that report on the same news event. This paper presents a novel approach based on constrained clustering. We discuss a general way for constrained clustering using a recent, graph-based clustering framework called correlation clustering. We introduce a correlation clustering implementation that features linear program chunking to allow processing larger datasets. We show how to apply the correlation clustering algorithm to the crosslingual link detection problem and present experimental results that show correlation clustering improves upon the hierarchical clustering approaches commonly used in link detection, and, hierarchical clustering approaches that take constraints into account.} } @inproceedings{VlaAndKor08a, title = {Dirichlet process mixture models for verb clustering}, author = {Vlachos, Andreas and Ghahramani, Zoubin and Korhonen, Anna}, booktitle = {Proceedings of the ICML workshop on Prior Knowledge for Text and Language}, year = 2008, url = {.}, abstract = {In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin’s (1993) verb classes using the recently introduced V- measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.} } @inproceedings{VlaAndKor09a, title = {Unsupervised and constrained {D}irichlet process mixture models for verb clustering}, author = {Vlachos, Andreas and Korhonen, Anna and Ghahramani, Zoubin}, booktitle = {Proceedings of the workshop on geometrical models of natural language semantics}, pages = {74--82}, year = 2009, cat = {nlp}, organization = {Association for Computational Linguistics}, abstract = {In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DP- MMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation per- formed highlights the benefits of both standard and constrained DPMMs com- pared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.}, url = {.} } @inproceedings{VlaGhaBri10, cat = {clust}, author = {Andreas Vlachos and Zoubin Ghahramani and Ted Briscoe}, year = 2010, title = {Active Learning for Constrained {Dirichlet} Process Mixture Models}, booktitle = {Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics}, pages = {57--61}, address = {Uppsala, Sweden}, url = {.}, abstract = {Recent work applied Dirichlet Process Mixture Models to the task of verb clustering, incorporating supervision in the form of must-links and cannot-links constraints between instances. In this work, we introduce an active learning approach for constraint selection employing uncertainty-based sampling. We achieve substantial improvements over random selection on two datasets.} } @inproceedings{VlaGhaKor08, cat = {clust}, booktitle = {ICML Workshop on Prior Knowledge for Text and Language Processing}, title = {{D}irichlet process mixture models for verb clustering}, author = {A. Vlachos and Z. Ghahramani and A Korhonen}, year = 2008, month = {July}, address = {Helsinki, Finland}, pages = {43--48}, editor = {Guillaume Bouchard and Hal {Daum\'{e} III} and Marc Dymetman and Yee Whye Teh}, url = {.}, abstract = {In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin's (1993) verb classes using the recently introduced V-measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.} } @inproceedings{VlaKorGha09, cat = {clust np nlp}, booktitle = {4th Workshop on Statistical Machine Translation, EACL '09}, title = {Unsupervised and constrained {Dirichlet} process mixture models for verb clustering}, author = {A. Vlachos and A Korhonen and Z. Ghahramani}, year = 2009, month = {March}, address = {Athens, Greece}, url = {.}, abstract = {In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.} } @inproceedings{Wel15, cat = {gm}, title = {Revisiting the limits of {MAP} inference by {MWSS} on perfect graphs}, booktitle = aistats18, year = 2015, month = {May}, address = {San Diego, California}, author = {Adrian Weller}, pages = {1061--1069}, abstract = {A recent, promising approach to identifying a configuration of a discrete graphical model with highest probability (termed MAP inference) is to reduce the problem to finding a maximum weight stable set (MWSS) in a derived weighted graph, which, if perfect, allows a solution to be found in polynomial time. Weller and Jebara (2013) investigated the class of binary pairwise models where this method may be applied. However, their analysis made a seemingly innocuous assumption which simplifies analysis but led to only a subset of possible reparameterizations being considered. Here we introduce novel techniques and consider all cases, demonstrating that this greatly expands the set of tractable models. We provide a simple, exact characterization of the new, enlarged set and show how such models may be efficiently identified, thus settling the power of the approach on this class.}, url = {http://jmlr.org/proceedings/papers/v38/weller15.pdf}, annote = {Also accepted for presentation at the 21st International Conference on Principles and Practice of Constraint Programming (CP 2015)} } @inproceedings{Wel15b, cat = {approx gm}, title = {{B}ethe and Related Pairwise Entropy Approximations}, booktitle = uai31, year = 2015, month = {July}, address = {Amsterdam}, author = {Adrian }, pages = {942--951}, abstract = {For undirected graphical models, belief propagation often performs remarkably well for approximate marginal inference, and may be viewed as a heuristic to minimize the Bethe free energy. Focusing on binary pairwise models, we demonstrate that several recent results on the Bethe approximation may be generalized to a broad family of related pairwise free energy approximations with arbitrary counting numbers. We explore approximation error and shed light on the empirical success of the Bethe approximation.}, url = {http://auai.org/uai2015/proceedings/papers/9.pdf}, annote = {Supplementary Material} } @inproceedings{Wel16, cat = {gm}, title = {Uprooting and Rerooting Graphical Models}, booktitle = icml33, year = 2016, month = {June}, address = {New York, NY}, url = {http://mlg.eng.cam.ac.uk/adrian/Wel16_Uproot.pdf}, author = {Adrian Weller}, abstract = {We show how any binary pairwise model may be ‘uprooted’ to a fully symmetric model, wherein original singleton potentials are transformed to potentials on edges to an added variable, and then ‘rerooted’ to a new model on the original number of variables. The new model is essentially equivalent to the original model, with the same partition function and allowing recovery of the original marginals or a MAP configuration, yet may have very different computational properties that allow much more efficient inference. This meta-approach deepens our understanding, may be applied to any existing algorithm to yield improved methods in practice, generalizes earlier theoretical results, and reveals a remarkable interpretation of the triplet-consistent polytope.} } @inproceedings{Wel16b, cat = {gm}, title = {Characterizing Tightness of {LP} Relaxations by Forbidding Signed Minors}, booktitle = uai32, year = 2016, month = {June}, address = {Jersey City, NJ}, url = {http://www.auai.org/uai2016/proceedings/papers/1.pdf}, author = {Adrian Weller}, abstract = {We consider binary pairwise graphical models and provide an exact characterization (necessary and sufficient conditions observing signs of potentials) of tightness for the LP relaxation on the triplet-consistent polytope of the MAP inference problem, by forbidding an odd-K5 (complete graph on 5 variables with all edges repulsive) as a signed minor in the signed suspension graph. This captures signs of both singleton and edge potentials in a compact and efficiently testable condition, and improves significantly on earlier results. We provide other results on tightness of LP relaxations by forbidding minors, draw connections and suggest paths for future research.}, annote = {Supplementary Material} } @inproceedings{WelDom16, cat = {gm}, title = {Clamping Improves {TRW} and Mean Field Approximations}, booktitle = aistats19, year = 2016, month = {May}, address = {Cadiz, Spain}, author = {Adrian Weller and Justin Domke}, abstract = {We examine the effect of clamping variables for approximate inference in undirected graphical models with pairwise relationships and discrete variables. For any number of variable labels, we demonstrate that clamping and summing approximate sub-partition functions can lead only to a decrease in the partition function estimate for TRW, and an increase for the naive mean field method, in each case guaranteeing an improvement in the approximation and bound. We next focus on binary variables, add the Bethe approximation to consideration and examine ways to choose good variables to clamp, introducing new methods. We show the importance of identifying highly frustrated cycles, and of checking the singleton entropy of a variable. We explore the value of our methods by empirical analysis and draw lessons to guide practitioners.}, url = {http://mlg.eng.cam.ac.uk/adrian/clamp_aistats_final.pdf} } @incollection{WelJeb14, cat = {approx gm}, title = {Clamping Variables and Approximate Inference}, author = {Weller, Adrian and Jebara, Tony}, booktitle = {Advances in Neural Information Processing Systems 28}, editor = {Z. Ghahramani and M. Welling and C. Cortes and N.D. Lawrence and K.Q. Weinberger}, pages = {909--917}, year = 2014, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/5529-clamping-variables-and-approximate-inference.pdf}, annote = {Supplementary Material}, abstract = { It was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition function is upper bounded by the true partition function for a binary pairwise model that is attractive. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise (and hence improve) the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error. } } @inproceedings{WelRowSon16, cat = {gm}, title = {Tightness of {LP} Relaxations for Almost Balanced Models}, booktitle = aistats19, year = 2016, month = {May}, address = {Cadiz, Spain}, url = {http://mlg.eng.cam.ac.uk/adrian/tricam.pdf}, author = {Adrian Weller and Mark Rowland and David Sontag}, abstract = {Linear programming (LP) relaxations are widely used to attempt to identify a most likely configuration of a discrete graphical model. In some cases, the LP relaxation attains an optimum vertex at an integral location and thus guarantees an exact solution to the original optimization problem. When this occurs, we say that the LP relaxation is tight. Here we consider binary pairwise models and derive sufﬁcient conditions for guaranteed tightness of (i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP relaxation on the triplet-consistent polytope LP+TRI (the next level in the Sherali-Adams hierarchy). We provide simple new proofs of earlier results and derive signiﬁcant novel results including that LP+TRI is tight for any model where each block is balanced or almost balanced, and a decomposition theorem that may be used to break apart complex models into smaller pieces. An almost balanced (sub-)model is one that contains no frustrated cycles except through one privileged variable.} } @inproceedings{RowWel17, cat = {gm}, title = {Uprooting and rerooting higher-order graphical models}, booktitle = nips31, year = 2017, month = {December}, address = {Long Beach, California}, author = {Mark Rowland and Adrian Weller}, abstract = {The idea of uprooting and rerooting graphical models was introduced specifically for binary pairwise models by Weller [18] as a way to transform a model to any of a whole equivalence class of related models, such that inference on any one model yields inference results for all others. This is very helpful since inference, or relevant bounds, may be much easier to obtain or more accurate for some model in the class. Here we introduce methods to extend the approach to models with higher-order potentials and develop theoretical insights. For example, we demonstrate that the triplet-consistent polytope TRI is unique in being 'universally rooted'. We demonstrate empirically that rerooting can significantly improve accuracy of methods of inference for higher-order models at negligible computational cost.} } @phdthesis{Wil14, cat = {gp np time}, author = {Andrew Gordon Wilson}, title = {Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with {G}aussian Processes}, year = 2014, url = {http://mlg.eng.cam.ac.uk/andrew/andrewgwthesis.pdf}, school = {University of Cambridge}, address = {Cambridge, UK}, abstract = {Truly intelligent systems are capable of pattern discovery and extrapolation without human intervention. Bayesian nonparametric models, which can uniquely represent expressive prior information and detailed inductive biases, provide a distinct opportunity to develop intelligent systems, with applications in essentially any learning and prediction task.

Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. A covariance kernel determines the support and inductive biases of a Gaussian process. In this thesis, we introduce new covariance kernels to enable fast automatic pattern discovery and extrapolation with Gaussian processes.

In the introductory chapter, we discuss the high level principles behind all of the models in this thesis: 1) we can typically improve the predictive performance of a model by accounting for additional structure in data; 2) to automatically discover rich structure in data, a model must have large support and the appropriate inductive biases; 3) we most need expressive models for large datasets, which typically provide more information for learning structure, and 4) we can often exploit the existing inductive biases (assumptions) or structure of a model for scalable inference, without the need for simplifying assumptions.

In the context of this introduction, we then discuss, in chapter 2, Gaussian processes as kernel machines, and my views on the future of Gaussian process research.

In chapter 3 we introduce the Gaussian process regression network (GPRN) framework, a multi-output Gaussian process method which scales to many output variables, and accounts for input-dependent correlations between the outputs. Underlying the GPRN is a highly expressive kernel, formed using an adaptive mixture of latent basis functions in a neural network like architecture. The GPRN is capable of discovering expressive structure in data. We use the GPRN to model the time-varying expression levels of 1000 genes, the spatially varying concentrations of several distinct heavy metals, and multivariate volatility (input dependent noise covariances) between returns on equity indices and currency exchanges, which is particularly valuable for portfolio allocation. We generalise the GPRN to an adaptive network framework, which does not depend on Gaussian processes or Bayesian nonparametrics; and we outline applications for the adaptive network in nuclear magnetic resonance (NMR) spectroscopy, ensemble learning, and change-point modelling.

In chapter 4 we introduce simple closed form kernel for automatic pattern discovery and extrapolation. These spectral mixture (SM) kernels are derived by modelling the spectral densiy of a kernel (its Fourier transform) using a scale-location Gaussian mixture. SM kernels form a basis for all stationary covariances, and can be used as a drop-in replacement for standard kernels, as they retain simple and exact learning and inference procedures. We use the SM kernel to discover patterns and perform long range extrapolation on atmospheric CO2 trends and airline passenger data, as well as on synthetic examples. We also show that the SM kernel can be used to automatically reconstruct several standard covariances. The SM kernel and the GPRN are highly complementary; we show that using the SM kernel with adaptive basis functions in a GPRN induces an expressive prior over non-stationary kernels.

In chapter 5 we introduce GPatt, a method for fast multidimensional pattern extrapolation, particularly suited to imge and movie data. Without human intervention -- no hand crafting of kernel features, and no sophisticated initialisation procedures -- we show that GPatt can solve large scale pattern extrapolation, inpainting and kernel discovery problems, including a problem with 383,400 training points. GPatt exploits the structure of a spectral mixture product (SMP) kernel, for fast yet exact inference procedures. We find that GPatt significantly outperforms popular alternative scalable gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits existing model structure are useful in combination for modelling large scale multidimensional patterns.

The models in this dissertation have proven to be scalable and with greatly enhanced predictive performance over the alternatives: the extra structure being modelled is an important part of a wide variety of real data -- including problems in econometrics, gene expression, geostatistics, nuclear magnetic resonance spectroscopy, ensemble learning, multi-output regression, change point modelling, time series, multivariate volatility, image inpainting, texture extrapolation, video extrapolation, acoustic modelling, and kernel discovery.} } @inproceedings{WilAda13, cat = {gp np time}, author = {Andrew Gordon Wilson and Ryan Prescott Adams}, title = {Gaussian Process Kernels for Pattern Discovery and Extrapolation}, booktitle = icml30, abstract = {Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density -- the Fourier transform of a kernel -- with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework.}, month = {February 18}, year = 2013, url = {http://jmlr.org/proceedings/papers/v28/wilson13.pdf}, annote = {arXiv:1302.4245} } @inproceedings{WilGha08, author = {Sinead Williamson and Zoubin Ghahramani}, year = 2008, title = {Probabilistic Models for Data Combination in Recommender Systems}, booktitle = {Learning from Multiple Sources Workshop, NIPS Conference}, address = {Whistler Canada}, url = {.} } @inproceedings{WilGha10, cat = {np gp time}, author = {Andrew Gordon Wilson and Zoubin Ghahramani}, title = {Copula Processes}, booktitle = nips23, year = 2010, abstract = {We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures. }, url = {http://books.nips.cc/papers/files/nips23/NIPS2010_0784.pdf}, annote = {Supplementary Material, slides.}, note = {Spotlight} } @inproceedings{WilGha11, cat = {np gp time}, author = {Andrew Gordon Wilson and Zoubin Ghahramani}, title = {Generalised {W}ishart Processes}, booktitle = uai27, year = 2011, abstract = {We introduce a new stochastic process called the generalised Wishart process (GWP). It is a collection of positive semi-definite random matrices indexed by any arbitrary input variable. We use this process as a prior over dynamic (e.g. time varying) covariance matrices. The GWP captures a diverse class of covariance dynamics, naturally hanles missing data, scales nicely with dimension, has easily interpretable parameters, and can use input variables that include covariates other than time. We describe how to construct the GWP, introduce general procedures for inference and prediction, and show that it outperforms its main competitor, multivariate GARCH, even on financial data that especially suits GARCH. }, url = {http://uai.sis.pitt.edu/papers/11/p736-wilson.pdf}, annote = {Supplementary Material, Best Student Paper Award} } @inproceedings{WilGha12a, author = {Wilson, Andrew Gordon and Ghahramani, Zoubin}, booktitle = {ECML/PKDD}, cat = {gp}, editor = {Flach, Peter A. and Bie, Tijl De and Cristianini, Nello}, isbn = {978-3-642-33485-6}, pages = {858-861}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, title = {Modelling Input Varying Correlations between Multiple Responses}, url = {.}, volume = 7524, year = 2012, abstract = {We introduced a generalised Wishart process (GWP) for modelling input dependent covariance matrices Σ(x), allowing one to model input varying correlations and uncertainties between multiple response variables. The GWP can naturally scale to thousands of response variables, as opposed to competing multivariate volatility models which are typically intractable for greater than 5 response variables. The GWP can also naturally capture a rich class of covariance dynamics -- periodicity, Brownian motion, smoothness, {\ldots}-- through a covariance kernel.} } @article{WilGilNehetal13, cat = {gp np time}, author = {Andrew Gordon Wilson and Elad Gilboa and Arye Nehorai and John P Cunningham}, title = {GPatt: {F}ast Multidimensional Pattern Extrapolation with {G}aussian Processes}, abstract = {Gaussian processes are typically used for smoothing and interpolation on small datasets. We introduce a new Bayesian nonparametric framework -- GPatt -- enabling automatic pattern extrapolation with Gaussian processes on large multidimensional datasets. GPatt unifies and extends highly expressive kernels and fast exact inference techniques. Without human intervention -- no hand crafting of kernel features, and no sophisticated initialisation procedures -- we show that GPatt can solve large scale pattern extrapolation, inpainting, and kernel discovery problems, including a problem with 383,400 training points. We find that GPatt significantly outperforms popular alternative scalable Gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits model structure are useful in combination for modelling large scale multidimensional patterns.}, journal = {arXiv preprint arXiv:1310.5288}, year = 2013, url = {http://arxiv.org/pdf/1310.5288v3} } @techreport{WilKnoGha11, cat = {np gp time deep}, author = {Andrew Gordon Wilson and David A Knowles and Zoubin Ghahramani}, title = {Gaussian Process Regression Networks}, number = {arXiv:1110.4411 [stat.ML]}, institution = {Department of Engineering, University of Cambridge}, address = {Cambridge, UK}, abstract = {We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset. }, month = {October 19}, year = 2011, url = {.}, annote = {arXiv:1110.4411} } @inproceedings{WilKnoGha12, cat = {gp}, author = {Andrew Gordon Wilson and David A. Knowles and Zoubin Ghahramani}, title = {Gaussian Process Regression Networks}, booktitle = icml29, year = 2012, month = {June}, address = {Edinburgh, Scotland}, url = {http://icml.cc/2012/papers/329.pdf}, abstract = {We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.} } @inproceedings{WilOrbGha10, cat = {np}, author = {Sinead Williamson and Peter Orbanz and Zoubin Ghahramani}, title = {Dependent {I}ndian buffet processes}, booktitle = aistats13, year = 2010, volume = 9, series = {W \& CP}, address = {Chia Laguna, Sardinia, Italy}, month = {May}, pages = {924--931}, url = {.}, abstract = {Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.} } @inproceedings{WilRas96, cat = {gp}, author = {Chris K.~I.~Williams and Carl Edward Rasmussen}, title = {Gaussian processes for regression}, booktitle = nips8, editors = {D.~S.~Touretzky and M.~C.~Mozer and M.~E.~Hasselmo}, pages = {514--520}, publisher = mit, address = {Cambridge, MA., USA}, abstract = {The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior over functions. We investigate the use of a Gaussian process prior over functions, which permits the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.}, year = 1996, url = {.} } @article{WilRasHen17, cat = {gp}, author = {Mark van der Wilk and Carl Edward Rasmussen and James Hensman}, title = {Convolutional {G}aussian Processes}, abstract = {We present a practical way of introducing convolutional structure into Gaussian processes, making them more suited to high-dimensional inputs like images. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10, which have both been known to be challenging for Gaussian processes. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. We hope that this illustration of the usefulness of a marginal likelihood will help automate discovering architectures in larger models.}, year = 2017, url = {https://arxiv.org/abs/1709.01894}, journal = {arXiv preprint arXiv:1709.01894} } @techreport{WilRasSchTre02, cat = {gp}, author = {Christopher K.~I.~Williams and Carl Edward Rasmussen and Anton Schwaighofer and Volker Tresp}, title = {Observations on the {N}ystr{\"o}m Method for {G}aussian Process Prediction}, url = {.}, year = 2002, institution = {University of Edinburgh}, abstract = {A number of methods for speeding up Gaussian Process (GP) prediction have been proposed, including the Nystr{\"o}m method of Williams and Seeger (2001). In this paper we focus on two issues (1) the relationship of the Nystr{\"o}m method to the Subset of Regressors method (Poggio and Girosi 1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the Nystr{\"o}m approximation would be expected to provide a good approximation to exact GP regression.} } @inproceedings{WilWanHelBle10, cat = {np}, author = {Sinead Williamson and Katherine A. Heller and C. Wang and D. M. Blei}, title = {The {IBP} compound {D}irichlet process and its application to focused topic modeling}, booktitle = icml27, pages = {1151--1158}, year = 2010, month = {June}, address = {Haifa, Israel}, url = {.}, abstract = {The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model --- each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point's "focus", is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.} } @article{WilYuHoletal14, cat = {gp sigproc time}, author = {Andrew Gordon Wilson and Yuting Wu and Daniel J. Holland and Sebastian Nowozin and Mick D. Mantle and Lynn F. Gladden and Andrew Blake}, title = {Bayesian Inference for {NMR} Spectroscopy with Applications to Chemical Quantification}, abstract = {Nuclear magnetic resonance (NMR) spectroscopy exploits the magnetic properties of atomic nuclei to discover the structure, reaction state and chemical environment of molecules. We propose a probabilistic generative model and inference procedures for NMR spectroscopy. Specifically, we use a weighted sum of trigonometric functions undergoing exponential decay to model free induction decay (FID) signals. We discuss the challenges in estimating the components of this general model -- amplitudes, phase shifts, frequencies, decay rates, and noise variances -- and offer practical solutions. We compare with conventional Fourier transform spectroscopy for estimating the relative concentrations of chemicals in a mixture, using synthetic and experimentally acquired FID signals. We find the proposed model is particularly robust to low signal to noise ratios (SNR), and overlapping peaks in the Fourier transform of the FID, enabling accurate predictions (e.g., 1\% error at low SNR) which are not possible with conventional spectroscopy (5\% error).}, year = 2014, journal = {arXiv preprint arXiv 1402.3580}, url = {http://arxiv.org/pdf/1402.3580v2} } @inproceedings{WolGhaJor94a, author = {Wolpert, Daniel M. and Ghahramani, Zoubin and Jordan, Michael I.}, booktitle = nips7, cat = {neuro}, editor = {Tesauro, Gerald and Touretzky, David S. and Leen, Todd K.}, pages = {43-50}, publisher = {MIT Press}, title = {Forward dynamic models in human motor control: Psychophysical evidence}, url = {.}, year = 1994, abstract = {Based on computational principles, with as yet no direct experimental validation, it has been proposed that the central nervous system (CNS) uses an internal model to simulate the dynamic behavior of the motor system in planning, control and learning. We present experimental results and simulations based on a novel approach that investigates the temporal propagation of errors in the sensorimotor integration process. Our results provide direct support for the existence of an internal model.} } @inproceedings{WooGriGha06a, author = {Wood, Frank and Griffiths, Thomas L. and Ghahramani, Zoubin}, booktitle = {UAI}, cat = {np}, isbn = {0-9749039-2-2}, publisher = {AUAI Press}, title = {A Non-Parametric {B}ayesian Method for Inferring Hidden Causes}, url = {.}, year = 2006, abstract = {We present a non-parametric Bayesian approach to structure learning with hidden causes. Previous Bayesian treatments of this problem define a prior over the number of hidden causes and use algorithms such as reversible jump Markov chain Monte Carlo to move between solutions. In contrast, we assume that the number of hidden causes is unbounded, but only a finite number influence observable variables. This makes it possible to use a Gibbs sampler to approximate the distribution over causal structures. We evaluate the performance of both approaches in discovering hidden causes in simulated data, and use our non-parametric approach to discover hidden causes in a real medical dataset.} } @inproceedings{WuHerGha13, cat = {mcmc time}, title = {Dynamic Covariance Models for Multivariate Financial Time Series}, author = {Yue Wu and Jos\'e Miguel Hern\'andez-Lobato and Zoubin Ghahramani}, booktitle = icml30, year = 2013, month = {June}, address = {Atlanta, Georgia, USA}, url = {http://jmlr.org/proceedings/papers/v28/wu13.pdf}, abstract = {The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.} } @inproceedings{XuHelGha09, cat = {np approx}, volume = 5, author = {Yang Xu and Katherine A. Heller and Zoubin Ghahramani}, note = {ISSN 1938-7228}, booktitle = aistats12, editor = {D.~van Dyk and M.~Welling}, title = {Tree-based inference for {D}irichlet process mixtures}, publisher = {Microtome Publishing (paper), Journal of Machine Learning Research (online)}, year = 2009, month = {April}, address = {Clearwater Beach, FL, USA}, pages = {623--630}, url = {.}, abstract = {The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density estimation. Unfortunately, like in many statistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational Bayesian (VB) [4] and collapsed variational methods [5], [6] recently introduced a novel class of approximation for DPMs based on Bayesian hierarchical clustering (BHC). These tree-based combinatorial approximations efficiently sum over exponentially many ways of partitioning the data and offer a novel lower bound on the marginal likelihood of the DPM [6]. In this paper we make the following contributions: (1) We show empirically that the BHC lower bounds are substantially tighter than the bounds given by VB [4] and by collapsed variational methods [5] on synthetic and real datasets. (2) We also show that BHC offers a more accurate predictive performance on these datasets. (3) We further improve the tree-based lower bounds with an algorithm that efficiently sums contributions from alternative trees. (4) We present a fast approximate method for BHC. Our results suggest that our combinatorial approximate inference methods and lower bounds may be useful not only in DPMs but in other models as well.} } @article{YuCunSanetal09a, cat = {gp}, author = {B. M. Yu and J. P. Cunningham and G. Santhanam and S. I. Ryu and K. V. Shenoy and M. Sahani}, title = {{G}aussian-process factor analysis for low-dimensional single-trial analysis of neural population activity}, url = {http://mlg.eng.cam.ac.uk/john/pubs/pdf/YuJNP2009.pdf}, journal = {Journal of Neurophysiology}, volume = 102, pages = {614-635}, year = 2009, abstract = {We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from many neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional, noisy spiking activity in a compact form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the spike trains are first smoothed over time, then a static dimensionality- reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way and that account for spiking variability, which may vary both across neurons and across time. We then present a novel method for extracting neural trajectories -- Gaussian-process factor analysis (GPFA) -- which unifies the smoothing and dimensionality- reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that the proposed extensions improved the predictive ability of the two-stage methods. The predictive ability was further improved by going to GPFA. From the extracted trajectories, we directly observed a convergence in neural state during motor planning, an effect that was shown indirectly by previous studies. We then show how such methods can be a powerful tool for relating the spiking activity across a neural population to the subject's behavior on a single-trial basis. Finally, to assess how well the proposed methods characterize neural population activity when the underlying time course is known, we performed simulations that revealed that GPFA performed tens of percent better than the best two-stage method.} } @inproceedings{YuCunSanetal09b, cat = {gp}, booktitle = nips21, author = {B. M. Yu and J. P. Cunningham and G. Santhanam and S. I. Ryu and K. V. Shenoy and M. Sahani}, title = {{G}aussian-process factor analysis for low-dimensional single-trial analysis of neural population activity}, url = {http://mlg.eng.cam.ac.uk/john/pubs/pdf/YuNIPS2009.pdf}, year = 2009, address = {Vancouver, BC}, month = {December}, pages = {1--8}, abstract = {We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from many neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional, noisy spiking activity in a compact form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the spike trains are first smoothed over time, then a static dimensionality- reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way and that account for spiking variability, which may vary both across neurons and across time. We then present a novel method for extracting neural trajectories -- Gaussian-process factor analysis (GPFA) -- which unifies the smoothing and dimensionality- reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that the proposed extensions improved the predictive ability of the two-stage methods. The predictive ability was further improved by going to GPFA. From the extracted trajectories, we directly observed a convergence in neural state during motor planning, an effect that was shown indirectly by previous studies. We then show how such methods can be a powerful tool for relating the spiking activity across a neural population to the subject's behavior on a single-trial basis. Finally, to assess how well the proposed methods characterize neural population activity when the underlying time course is known, we performed simulations that revealed that GPFA performed tens of percent better than the best two-stage method.} } @article{ZhaBatCunetal11, cat = {time}, author = {M. Zhao and A. P. Batista and J. P. Cunningham and C. A. Chestek and Z. Rivera-Alvidrez and R. Kalmar and S. I. Ryu and K. V. Shenoy and S. Iyengar}, title = {An {L}1-regularized logistic model for detecting short-term neuronal interactions.}, url = {http://mlg.eng.cam.ac.uk/john/pubs/pdf/ZhaoJCNS2011.pdf}, doi = {10.1007/s10827-011-0365-5}, journal = {Journal of Computational Neuroscience}, note = {In Press.}, year = 2011, abstract = {Interactions among neurons are a key com- ponent of neural signal processing. Rich neural data sets potentially containing evidence of interactions can now be collected readily in the laboratory, but existing analysis methods are often not sufficiently sensitive and specific to reveal these interactions. Generalized linear models offer a platform for analyzing multi-electrode recordings of neuronal spike train data. Here we suggest an L1-regularized logistic regression model (L1L method) to detect short-term (order of 3ms) neuronal interactions. We estimate the parameters in this model using a coordinate descent algorithm, and determine the optimal tuning parameter using a Bayesian Information Criterion. Simulation studies show that in general the L1L method has better sensitivities and specificities than those of the traditional shuffle-corrected cross-correlogram (covariogram) method. The L1L method is able to detect excitatory interactions with both high sensitivity and specificity with reasonably large recordings, even when the magnitude of the interactions is small; similar results hold for inhibition given sufficiently high baseline firing rates. Our study also suggests that the false positives can be further removed by thresholding, because their magnitudes are typically smaller than true interactions. Simulations also show that the L1L method is somewhat robust to partially observed networks. We apply the method to multi-electrode recordings collected in the monkey dorsal premotor cortex (PMd) while the animal prepares to make reaching arm movements. The results show that some neurons interact differently depending on task conditions. The stronger interactions detected with our L1L method were also visible using the covariogram method.} } @inproceedings{ZhaGhaYan04a, author = {Zhang, Jian and Ghahramani, Zoubin and Yang, Yiming}, booktitle = {NIPS}, cat = {nlp}, editor = {Thrun, Sebastian and Saul, Lawrence K. and Sch{\"o}lkopf, Bernhard}, isbn = {0-262-20152-6}, publisher = {MIT Press}, title = {A Probabilistic Model for Online Document Clustering with Application to Novelty Detection}, url = {.}, year = 2004, abstract = {In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.} } @inproceedings{ZhaGhaYan05a, author = {Zhang, Jian and Ghahramani, Zoubin and Yang, Yiming}, booktitle = {NIPS}, cat = {nlp}, title = {Learning Multiple Related Tasks using Latent Independent Component Analysis}, url = {.}, year = 2005, abstract = {We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efficient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classification data sets show that the proposed approach is promising. } } @article{ZhaGhaYan08, cat = {gm}, volume = 73, number = 3, month = {December}, author = {J. Zhang and Z. Ghahramani and Y. Yang}, title = {Flexible latent variable models for multi-task learning}, publisher = {Springer Netherlands}, year = 2008, journal = {Machine Learning}, pages = {221--242}, url = {.}, abstract = {Given multiple prediction problems such as regression and classification, we are interested in a joint inference framework which can effectively borrow information among tasks to improve the prediction accuracy, especially when the number of training examples per problem is small. In this paper we propose a probabilistic framework which can support a set of latent variable models for different multi-task learning scenarios. We show that the framework is a generalization of standard learning methods for single prediction problems and it can effectively model the shared structure among different prediction tasks. Furthermore, we present efficient algorithms for the empirical Bayes method as well as point estimation. Our experiments on both simulated datasets and real world classification datasets show the effectiveness of the proposed models in two evaluation settings: standard multi-task learning setting and transfer learning setting.} } @inproceedings{ZhaSutSto12a, author = {Zhang, Yichuan and Sutton, Charles A. and Storkey, Amos J. and Ghahramani, Zoubin}, booktitle = {NIPS}, cat = {approx, mcmc}, editor = {Bartlett, Peter L. and Pereira, Fernando C. N. and Burges, Christopher J. C. and Bottou, L{\'e}on and Weinberger, Kilian Q.}, pages = {3203-3211}, title = {Continuous Relaxations for Discrete {H}amiltonian {M}onte {C}arlo}, url = {.}, year = 2012, abstract = {Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems.} } @inproceedings{ZhuGhaLaf03a, author = {Zhu, Xiaojin and Ghahramani, Zoubin and Lafferty, John D.}, booktitle = {ICML}, cat = {ssl}, editor = {Fawcett, Tom and Mishra, Nina}, isbn = {1-57735-189-4}, pages = {912-919}, publisher = {AAAI Press}, title = {Semi-Supervised Learning Using {G}aussian Fields and Harmonic Functions}, url = {.}, year = 2003, abstract = {An approach to semi-supervised learning is proposed that is based on a Gaussian random field model. Labeled and unlabeled data are represented as vertices in a weighted graph, with edge weights encoding the similarity between instances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propagation. The resulting learning algorithms have intimate connections with random walks, electric networks, and spectral graph theory. We discuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text classification tasks.} } @inproceedings{ZhuKanGha04a, author = {Zhu, Xiaojin and Kandola, Jaz S. and Ghahramani, Zoubin and Lafferty, John D.}, booktitle = {NIPS}, cat = {ssl}, editor = {Thrun, Sebastian and Saul, Lawrence K. and Sch{\"o}lkopf, Bernhard}, isbn = {0-262-20152-6}, publisher = {MIT Press}, title = {Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning}, url = {.}, year = 2004, abstract = {We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. } }