## Publications by Year

## 2022

#### Sparse MoEs meet Efficient Ensembles

James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, Rodolphe Jenatton, 2022. (Transactions on Machine Learning Research).

Abstract▼ URL

Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, often exhibit strong performance compared to individual models. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that the two approaches have complementary features whose combination is beneficial. This includes a comprehensive evaluation of sparse MoEs in uncertainty related benchmarks. Then, we present efficient ensemble of experts (E3), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive experiments demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty improvements of E3 over several challenging vision Transformer-based baselines. E3 not only preserves its efficiency while scaling to models with up to 2.7B parameters, but also provides better predictive performance and uncertainty estimates for larger models.

**Comment:** Code

#### Adapting the Linearised Laplace Model Evidence for Modern Deep Learning

Javier Antorán, David Janz, James Urquhart Allingham, Erik A. Daxberger, Riccardo Barbano, Eric T. Nalisnick, José Miguel Hernández-Lobato, 2022. (In 39th International Conference on Machine Learning). Edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, Sivan Sabato. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning–stochastic approximation methods and normalisation layers–and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.

#### Partitioned Variational Inferece: A Framework for Probabilistic Federated Learning

Matthew Ashman, Thang D. Bui, Cuong V. Nguyen, Efstratios Markou, Adrian Weller, Siddharth Swaroop, Richard E. Turner, 2022.

Abstract▼ URL

The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.

#### Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the Compact Case

Iskander Azangulov, Andrei Smolensky, Alexander Terenin, Viacheslav Borovitskiy, 2022. (arXiv).

Abstract▼ URL

Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process’ covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.

#### On the Utility of Prediction Sets in Human-AI Teams

Varun Babbar, Umang Bhatt, Adrian Weller, 2022. (In International Joint Conference on Artificial Intelligence).

Abstract▼ URL

Research on human-AI teams usually provides experts with a single label, which ignores the uncertainty in a model’s recommendation. Conformal prediction (CP) is a well established line of research that focuses on building a theoretically grounded, calibrated prediction set, which may contain multiple labels. We explore how such prediction sets impact expert decision-making in human-AI teams. Our evaluation on human subjects finds that set valued predictions positively impact experts. However, we notice that the predictive sets provided by CP can be very large, which leads to unhelpful AI assistants. To mitigate this, we introduce D-CP, a method to perform CP on some examples and defer to experts. We prove that D-CP can reduce the prediction set size of non-deferred examples. We show how D-CP performs in quantitative and in human subject experiments (n=120). Our results suggest that CP prediction sets improve human-AI team performance over showing the top-1 prediction alone, and that experts find D-CP prediction sets are more useful than CP prediction sets.

#### Modelling Non-Smooth Signals with Complex Spectral Structure

Wessel P. Bruinsma, Martin Tegnér, Richard E. Turner, 2022. (In 25th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

The Gaussian Process Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with complex spectral structure. A significant limitation of the GPCM is that it assumes a rapidly decaying spectrum: it can only model smooth signals. Moreover, inference in the GPCM currently requires (1) a mean-field assumption, resulting in poorly calibrated uncertainties, and (2) a tedious variational optimisation of large covariance matrices. We redesign the GPCM model to induce a richer distribution over the spectrum with relaxed assumptions about smoothness: the Causal Gaussian Process Convolution Model (CGPCM) introduces a causality assumption into the GPCM, and the Rough Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian nonparametric generalisation of the fractional Ornstein–Uhlenbeck process. We also propose a more effective variational inference scheme, going beyond the mean-field assumption: we design a Gibbs sampler which directly samples from the optimal variational solution, circumventing any variational optimisation entirely. The proposed variations of the GPCM are validated in experiments on synthetic and real-world data, showing promising results.

#### Scalable Approximate Inference and Model Selection in Gaussian Process Regression

David R. Burt, 2022. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Models with Gaussian process priors and Gaussian likelihoods are one of only a handful of Bayesian models where inference can be performed without the need for approximation. However, a frequent criticism of these models from practitioners of Bayesian machine learning is that they are challenging to scale to large datasets due to the need to compute a large kernel matrix and perform standard linear-algebraic operations with this matrix. This limitation has driven decades of research in both statistics and machine learning seeking to scale Gaussian process regression models to ever-larger datasets. This thesis builds on this line of research. We focus on the problem of approximate inference and model selection with approximate maximum marginal likelihood as applied to Gaussian process regression. Our discussion is guided by three questions: Does an approximation work on a range of models and datasets? Can you verify that an approximation has worked on a given dataset? Is an approximation easy for a practitioner to use? While we are far from the first to ask these questions, we offer new insights into each question in the context of Gaussian process regression. In the first part of this thesis, we focus on sparse variational Gaussian process regression (Titsias, 2009). We provide new diagnostics for inference with this method that can be used as practical guides for practitioners trying to balance computation and accuracy with this approximation. We then provide an asymptotic analysis that highlights properties of the model and dataset that are sufficient for this approximation to perform reliable inference with a small computational cost. This analysis builds on an approach laid out in Burt (2018), as well as on similar guarantees in the kernel ridge regression literature. In the second part of this thesis, we consider iterative methods, especially the method of conjugate gradients, as applied to Gaussian process regression (Gibbs and MacKay, 1997). We primarily focus on improving the reliability of approximate maximum marginal likelihood when using these approximations. We investigate how the method of conjugate gradients and related approaches can be used to derive bounds on quantities related to the log marginal likelihood. This idea can be used to improve the speed and stability of model selection with these approaches, making them easier to use in practice.

#### Racial Disparities in the Enforcement of Marijuana Violations in the US

Bradley Butcher, Chris Robinson, Miri Zilka, Riccardo Fogliato, Carolyn Ashurst, Adrian Weller, 2022. (Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society).

Abstract▼ URL

Racial disparities in US drug arrest rates have been observed for decades, but their causes and policy implications are still contested. Some have argued that the disparities largely reflect differences in drug use between racial groups, while others have hypothesized that discriminatory enforcement policies and police practices play a significant role. In this work, we analyze racial disparities in the enforcement of marijuana violations in the US. Using data from the National Incident-Based Reporting System (NIBRS) and the National Survey on Drug Use and Health (NSDUH) programs, we investigate whether marijuana usage and purchasing behaviors can explain the racial composition of offenders in police records. We examine potential driving mechanisms behind these disparities and the extent to which county-level socioeconomic factors are associated with corresponding disparities. Our results indicate that the significant racial disparities in reported incidents and arrests cannot be explained by differences in marijuana days-of-use alone. Variations in the location where marijuana is purchased and in the frequency of these purchases partially explain the observed disparities. We observe an increase in racial disparities across most counties over the last decade, with the greatest increases in states that legalized the use of marijuana within this timeframe. Income, high school graduation rate, and rate of employment positively correlate with larger racial disparities, while the rate of incarceration is negatively correlated. We conclude with a discussion of the implications of the observed racial disparities in the context of algorithmic fairness.

#### Evaluating Model-Based Planning and Planner Amortization for Continuous Control

Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We show that MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency with respect to model-free methods. However, we find that well-tuned model-free agents are strong baselines even for high DoF control problems. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance.

#### Optimal Client Sampling for Federated Learning

Wenlin Chen, Samuel Horváth, Peter Richtárik, August 2022. (Transactions on Machine Learning Research).

Abstract▼ URL

It is well understood that client-master communication can be a primary bottleneck in federated learning (FL). In this work, we address this issue with a novel client subsampling scheme, where we restrict the number of clients allowed to communicate their updates back to the master node. In each communication round, all participating clients compute their updates, but only the ones with important updates communicate back to the master. We show that importance can be measured using only the norm of the update and give a formula for optimal client participation. This formula minimizes the distance between the full update, where all clients participate, and our limited update, where the number of participating clients is restricted. In addition, we provide a simple algorithm that approximates the optimal formula for client participation, which allows for secure aggregation and stateless clients, and thus does not compromise client privacy. We show both theoretically and empirically that for Distributed SGD (DSGD) and Federated Averaging (FedAvg), the performance of our approach can be close to full participation and superior to the baseline where participating clients are sampled uniformly. Moreover, our approach is orthogonal to and compatible with existing methods for reducing communication overhead, such as local methods and communication compression methods.

**Comment:** arXiv

#### Contrasting Discrete and Continuous Methods for Bayesian System Identification

Talay M Cheema, 2022. (In Workshop on Continuous Time Machine Learning at the 39th International Conference on Machine Learning).

Abstract▼ URL

In recent years, there has been considerable interest in embedding continuous time methods in machine learning algorithms. In system identification, the task is to learn a dynamical model from incomplete observation data, and when prior knowledge is in continuous time – for example, mechanistic differential equation models – it seems natural to use continuous time models for learning. Yet when learning flexible, nonlinear, probabilistic dynamics models, most previous work has focused on discrete time models to avoid computational, numerical, and mathematical difficulties. In this work we show, with the aid of small-scale examples, that this mismatch between model and data generating process can be consequential under certain circumstances, and we discuss possible modifications to discrete time models which may better suit them to handling data generated by continuous time processes.

#### Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction

Wenlin Chen, Austin Tripp, José Miguel Hernández-Lobato, 2022. (arXiv).

Abstract▼ URL

We propose Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework for learning deep kernel Gaussian processes (GPs) by interpolating between meta-learning and conventional deep kernel learning. Our approach employs a bilevel optimization objective where we meta-learn generally useful feature representations across tasks, in the sense that task-specific GP models estimated on top of such features achieve the lowest possible predictive loss on average. We solve the resulting nested optimization problem using the implicit function theorem (IFT). We show that our ADKF-IFT framework contains previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT) as special cases. Although ADKF-IFT is a completely general method, we argue that it is especially well-suited for drug discovery problems and demonstrate that it significantly outperforms previous state-of-the-art methods on a variety of real-world few-shot molecular property prediction tasks and out-of-domain molecular property prediction and optimization tasks.

#### Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation

Ross M. Clarke, Elre T. Oldewage, José Miguel Hernández-Lobato, April 2022. (In 10th International Conference on Learning Representations). Virtual.

Abstract▼ URL

Machine learning training methods depend plentifully and intricately on hyperparameters, motivating automated strategies for their optimisation. Many existing algorithms restart training for each new hyperparameter choice, at considerable computational cost. Some hypergradient- based one-pass methods exist, but these either cannot be applied to arbitrary optimiser hyperparameters (such as learning rates and momenta) or take several times longer to train than their base models. We extend these existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuous hyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient, and perform tractable gradient-based optimisation of independent learning rates for each model parameter. Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time only 2-3x greater than vanilla training.

#### Wide Mean-Field Bayesian Neural Networks Ignore the Data

Beau Coker, Wessel P. Bruinsma, David R. Burt, Weiwei Pan, Finale Doshi-Velez, 2022. (In 25th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.

#### Eliciting and Learning with Soft Labels from Every Annotator

Katherine M. Collins, Umang Bhatt, Adrian Weller, 2022. (In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP)). **DOI**: 10.17863/CAM.87954.

Abstract▼ URL

The labels used to train machine learning (ML) models are of paramount importance. Typically for ML classification tasks, datasets contain hard labels, yet learning using soft labels has been shown to yield benefits for model generalization, robustness, and calibration. Earlier work found success in forming soft labels from multiple annotators’ hard labels; however, this approach may not converge to the best labels and necessitates many annotators, which can be expensive and inefficient. We focus on efficiently eliciting soft labels from individual annotators. We collect and release a dataset of soft labels (which we call CIFAR-10S) over the CIFAR-10 test set via a crowdsourcing study (N=248). We demonstrate that learning with our labels achieves comparable model performance to prior approaches while requiring far fewer annotators – albeit with significant temporal costs per elicitation. Our elicitation methodology therefore shows nuanced promise in enabling practitioners to enjoy the benefits of improved model performance and reliability with fewer annotators, and serves as a guide for future dataset curators on the benefits of leveraging richer information, such as categorical uncertainty, from individual annotators.

**Comment:** [Project Page] [Data] [Code]

#### Can We Automate the Analysis of Online Child Sexual Exploitation Discourse?

Darren Cook, Miri Zilka, Heidi DeSandre, Susan Giles, Adrian Weller, Simon Maskell, 2022. (arXiv).

Abstract▼ URL

Social media’s growing popularity raises concerns around children’s online safety. Interactions between minors and adults with predatory intentions is a particularly grave concern. Research into online sexual grooming has often relied on domain experts to manually annotate conversations, limiting both scale and scope. In this work, we test how well-automated methods can detect conversational behaviors and replace an expert human annotator. Informed by psychological theories of online grooming, we label 6772 chat messages sent by child-sex offenders with one of eleven predatory behaviors. We train bag-of-words and natural language inference models to classify each behavior, and show that the best performing models classify behaviors in a manner that is consistent, but not on-par, with human annotation.

#### Neural Diffusion Processes

Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, Fergus Simpson, Apr 2022. (In arXiv). Online.

Abstract▼ URL

Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learn to sample from distributions over functions. Using a novel attention block, we can incorporate properties of stochastic processes, such as exchangeability, directly into the NDP’s architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior of a Gaussian process. This enables a variety of downstream tasks, including hyperparameter marginalisation and Bayesian optimisation.

#### Probable Domain Generalization via Quantile Risk Minimization

C. Eastwood, A. Robey, S. Singh, J. von Kügelgen, H. Hassani, G. J. Pappas, B. Schölkopf, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc..

Abstract▼ URL

Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the α-quantile of predictor’s risk distribution over domains, QRM seeks predictors that perform well with probability α. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm, and prove: (i) a generalization bound for EQRM; and (ii) that EQRM recovers the causal predictor as α->1. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG, and demonstrate that EQRM outperforms state-of-the-art baselines on CMNIST and several datasets from WILDS and DomainBed.

#### Fast relative Entropy coding with A* coding

Gergely Flamich, Stratis Markou, José Miguel Hernández-Lobato, 2022. (In 39th International Conference on Machine Learning).

Abstract▼ URL

Relative entropy coding (REC) algorithms encode a sample from a target distribution Q using a proposal distribution P, such that the expected codelength is 𝒪(D_KL[Q||P]). REC can be seamlessly integrated with existing learned compression models since, unlike entropy coding, it does not assume discrete Q or P, and does not require quantisation. However, general REC algorithms require an intractable Ω(e^D_KL[Q||P]) runtime. We introduce AS* and AD* coding, two REC algorithms based on A* sampling. We prove that, for continuous distributions over ℝ, if the density ratio is unimodal, AS* has 𝒪(D_∞[Q||P]QP) expected runtime, where D_∞[Q||P]QP is the Rényi ∞-divergence. We provide experimental evidence that AD* also has 𝒪(D_∞[Q||P]QP) expected runtime. We prove that AS* and AD* achieve an expected codelength of 𝒪(D_KL[Q||P]). Further, we introduce DAD*, an approximate algorithm based on AD* which retains its favourable runtime and has bias similar to that of alternative methods. Focusing on VAEs, we propose the IsoKL VAE (IKVAE), which can be used with DAD* to further improve compression efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can losslessly compress images near the theoretically optimal limit.

#### Deep Classifiers with Label Noise Modeling and Distance Awareness

Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham, Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou, 2022. (Transactions on Machine Learning Research).

Abstract▼ URL

Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which additionally models uncertainty over the network parameters and outperforms other ensemble baselines.

**Comment:** Code

#### Bayesian neural network priors revisited

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, Laurence Aitchison, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

#### Learning Deep Neural Networks Through Iterative Linearisation

Adrian Goldwaser, Hong Ge, 2022. (In Neurips 2022 Workshop Optimisation in Machine Learning).

Abstract▼ URL

The excellent real-world performance of deep neural networks has received increasing attention. Despite the capacity to overfit significantly, such large models work better than smaller ones. This phenomenon is often referred to as the scaling law by practitioners. It is of fundamental interest to study why the scaling law exists and how it avoids/controls overfitting. One approach has been looking at infinite width limits of neural networks (e.g., Neural Tangent Kernels, Gaussian Processes); however, in practise, these do not fully explain finite networks as their infinite counterparts do not learn features. Furthermore, the empirical kernel for finite networks (i.e., the inner product of feature vectors), changes significantly during training in contrast to infinite width networks. In this work we derive an iterative linearised training method. We justify iterative lineralisation as an interpolation between finite analogs of the infinite width regime, which do not learn features, and standard gradient descent training which does. We show some preliminary results where iterative linearised training works well, noting in particular how much feature learning is required to achieve comparable performance. We also provide novel insights into the training behaviour of neural networks.

#### Causal Inference Through the Structural Causal Marginal Problem

L. Gresele*, J. von Kügelgen*, J. M. Kübler*, E. Kirschbaum, B. Schölkopf, D. Janzing, 2022. (In 39th International Conference on Machine Learning). Edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, Sivan Sabato. PMLR. Note: *equal contribution.

Abstract▼ URL

We introduce an approach to counterfactual inference based on merging information from multiple datasets. We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones. We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs. Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data.

#### Modelling content creator incentives on algorithm-curated platforms

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, Sarah Dean, 2022. (arXiv).

Abstract▼ URL

Content creators compete for user attention. Their reach crucially depends on algorithmic choices made by developers on online platforms. To maximize exposure, many creators adapt strategically, as evidenced by examples like the sprawling search engine optimization industry. This begets competition for the finite user attention pool. We formalize these dynamics in what we call an exposure game, a model of incentives induced by algorithms including modern factorization and (deep) two-tower architectures. We prove that seemingly innocuous algorithmic choices—e.g., non-negative vs. unconstrained factorization—significantly affect the existence and character of (Nash) equilibria in exposure games. We proffer use of creator behavior models like ours for an (ex-ante) pre-deployment audit. Such an audit can identify misalignment between desirable and incentivized content, and thus complement post-hoc measures like content filtering and moderation. To this end, we propose tools for numerically finding equilibria in exposure games, and illustrate results of an audit on the MovieLens and LastFM datasets. Among else, we find that the strategically produced content exhibits strong dependence between algorithmic exploration and content diversity, and between model expressivity and bias towards gender-based user and creator groups.

#### Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, 2022. (ICML).

Abstract▼ URL

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN. This contrasts with the typically poor performance of MCMC in high dimensions. We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks. Improvements are achieved at all widths, with the margin between reparametrised and standard BNNs growing with layer width.

#### Variational Inference in Dynamical Systems

Alessandro Davide Ialongo, 2022. University of Cambridge, Department of Engineering, Cambridge, UK. **DOI**: https://doi.org/10.17863/CAM.91368.

Abstract▼ URL

Dynamical systems are a powerful formalism to analyse the world around us. Many datasets are sequential in nature, and can be described by a discrete time evolution law. We are interested in approaching the analysis of such datasets from a probabilistic perspective. We would like to maintain justified beliefs about quantities which, though useful in explaining the behaviour of a system, may not be observable, as well as about the system’s evolution itself, especially in regimes we have not yet observed in our data. The framework of statistical inference gives us the tools to do so, yet, for many systems of interest, performing inference exactly is not computationally or analytically tractable. The contribution of this thesis, then, is twofold: first, we uncover two sources of bias in existing variational inference methods applied to dynamical systems in general, and state space models whose transition function is drawn from a Gaussian process (GPSSM) in particular. We show bias can derive from assuming posteriors in non-linear systems to be jointly Gaussian, and from assuming that we can sever the dependence between latent states and transition function in state space model posteriors. Second, we propose methods to address these issues, undoing the resulting biases. We do this without compromising on computational efficiency or on the ability to scale to larger datasets and higher dimensions, compared to the methods we rectify. One method, the Markov Autoregressive Flow (Markov AF) addresses the Gaussian assumption, by providing a more flexible class of posteriors, based on normalizing flows, which can be easily evaluated, sampled, and optimised. The other method, Variationally Coupled Dynamics and Trajectories (VCDT), tackles the factorisation assumption, leveraging sparse Gaussian processes and their variational representation to reintroduce dependence between latent states and the transition function at no extra computational cost. Since the objective of inference is to maintain calibrated beliefs, if we employed approximations which are significantly biased in non-linear, noisy systems, or when there is little data available, we would have failed in our objective, as those are precisely the regimes in which uncertainty quantification is all the more important. Hence we think it is essential, if we wish to act optimally on such beliefs, to uncover, and, if possible, to correct, all sources of systematic bias in our inference methods.

#### From statistical to causal learning

B. Schölkopf*, J. von Kügelgen*, 2022. (In Proceedings of the International Congress of Mathematicians (ICM)). EMS Press. **Note**: *equal contribution.

Abstract▼ URL

We describe basic ideas underlying research to build and understand artificially intelligent systems: from symbolic approaches via statistical learning to interventional models relying on concepts of causality. Some of the hard open problems of machine learning and AI are intrinsically related to causality, and progress may require advances in our understanding of how to model and infer causality from data.

#### On the Fairness of Causal Algorithmic Recourse

J. von Kügelgen, A.-H. Karimi, U. Bhatt, I. Valera, A. Weller, B. Schölkopf, 2022. (In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI)).

Abstract▼ URL

Algorithmic fairness is typically studied from the perspective of predictions. Instead, here we investigate fairness from the perspective of recourse actions suggested to individuals to remedy an unfavourable classification. We propose two new fairness criteria at the group and individual level, which – unlike prior work on equalising the average group-wise distance from the decision boundary – explicitly account for causal relationships between features, thereby capturing downstream effects of recourse actions performed in the physical world. We explore how our criteria relate to others, such as counterfactual fairness, and show that fairness of recourse is complementary to fairness of prediction. We study theoretically and empirically how to enforce fair causal recourse by altering the classifier and perform a case study on the Adult dataset. Finally, we discuss whether fairness violations in the data generating process revealed by our criteria may be better addressed by societal interventions as opposed to constraints on the classifier.

#### Sparse Gaussian Process Hyperparameters: Optimize or Integrate?

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, Carl E. Rasmussen, 2022. (In nips36).

Abstract▼ URL

The kernel function and its hyperparameters are the central model selection choice in a Gaussian process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of [Titsias, 2009]. This work is closely related to Hensman et al. [2015b], but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.

#### Generalised GPLVM with Stochastic Variational Inference

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence, 28–30 Mar 2022. (In 25th International Conference on Artificial Intelligence and Statistics). PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Gaussian process latent variable models (GPLVM) are a flexible and non-linear approach to dimensionality reduction, extending classical Gaussian processes to an unsupervised learning context. The Bayesian incarnation of the GPLVM uses a variational framework, where the posterior over latent variables is approximated by a well-behaved variational family, a factorised Gaussian yielding a tractable lower bound. However, the non-factorisability of the lower bound prevents truly scalable inference. In this work, we study the doubly stochastic formulation of the Bayesian GPLVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. Further, we demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions. We demonstrate the model’s performance by benchmarking against the canonical sparse GPLVM for high dimensional data examples.

#### Kernel Learning for Explainable Climate Science

Vidhi Lalchand, Kenza Tazi, Talay M Cheema, Richard E Turner, Scott Hosking, 2022. (In 16th Bayesian Modelling Applications Workshop at UAI, 2022).

Abstract▼ URL

The Upper Indus Basin, Himalayas provides water for 270 million people and countless ecosystems. However, precipitation, a key component to hydrological modelling, is poorly understood in this area. A key challenge surrounding this uncertainty comes from the complex spatial-temporal distribution of precipitation across the basin. In this work we propose Gaussian processes with structured non-stationary kernels to model precipitation patterns in the UIB. Previous attempts to quantify or model precipitation in the Hindu Kush Karakoram Himalayan region have often been qualitative or include crude assumptions and simplifications which cannot be resolved at lower resolutions. This body of research also provides little to no error propagation. We account for the spatial variation in precipitation with a non-stationary Gibbs kernel parameterised with an input dependent lengthscale. This allows the posterior function samples to adapt to the varying precipitation patterns inherent in the distinct underlying topography of the Indus region. The input dependent lengthscale is governed by a latent Gaussian process with a stationary squared-exponential kernel to allow the function level hyperparameters to vary smoothly. In ablation experiments we motivate each component of the proposed kernel by demonstrating its ability to model the spatial covariance, temporal structure and joint spatio-temporal reconstruction. We benchmark our model with a stationary Gaussian process and a Deep Gaussian processes.

#### Goal Misgeneralization in Deep Reinforcement Learning

Lauro Langosco di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger, 2022. (In icml2022).

#### Complex interlinkages, key objectives and nexuses amongst the Sustainable Development Goals and climate change: a network analysis

F. Laumann, J. von Kügelgen, T. H. Kanashiro Uehara, M. Barahona, 2022. (The Lancet Planetary Health). **DOI**: 10.1016/S2542-5196(22)00070-5.

Abstract▼ URL

Background: Global sustainability is an enmeshed system of complex socioeconomic, climatological, and ecological interactions. The numerous objectives of the UN’s Sustainable Development Goals (SDGs) and the Paris Agreement have various levels of interdependence, making it difficult to ascertain the influence of changes to particular indicators across the whole system. In this analysis, we aimed to detect and rank the complex interlinkages between objectives of sustainability agendas. Methods: We developed a method to find interlinkages among the 17 SDGs and climate change, including non-linear and non-monotonic dependences. We used time series of indicators defined by the World Bank, consisting of 400 indicators that measure progress towards the 17 SDGs and an 18th variable (annual average temperatures), representing progress in the response to the climate crisis, from 2000 to 2019. This method detects significant dependencies among the time evolution of the objectives by using partial distance correlations, a non-linear measure of conditional dependence that also discounts spurious correlations originating from lurking variables. We then used a network representation to identify the most important objectives (using network centrality) and to obtain nexuses of objectives (defined as highly interconnected clusters in the network). Findings: Using temporal data from 181 countries spanning 20 years, we analysed dependencies among SDGs and climate for 35 country groupings based on region, development, and income level. The observed significant interlinkages, central objectives, and nexuses identified varied greatly across country groupings; however, SDG 17 (partnerships for the goals) and climate change ranked as highly important across many country groupings. Temperature rise was strongly linked to urbanisation, air pollution, and slum expansion (SDG 11), especially in country groupings likely to be worst affected by climate breakdown, such as Africa. In several country groupings composed of developing nations, we observed a consistent nexus of strongly interconnected objectives formed by SDG 1 (poverty reduction), SDG 4 (education), and SDG 8 (economic growth), sometimes incorporating SDG 5 (gender equality), and SDG 16 (peace and justice). Interpretation: The differences across groupings emphasise the need to define goals in accordance with local circumstances and priorities. Our analysis highlights global partnerships (SDG 17) as a pivot in global sustainability efforts, which have been strongly linked to economic growth (SDG 8). However, if economic growth and trade expansion were repositioned as a means instead of an end goal of development, our analysis showed that education (SDG 4) and poverty reduction (SDG 1) become more central, thus suggesting that these could be prioritised in global partnerships. Urban livelihoods (SDG 11) were also flagged as important to avoid replicating unsustainable patterns of the past.

#### Diverse and Amortised Counterfactual Explanations for Uncertainty Estimates

Dan Ley, Umang Bhatt, Adrian Weller, 2022. (In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI)).

Abstract▼ URL

To interpret uncertainty estimates from differentiable probabilistic models, recent work has proposed generating a single Counterfactual Latent Uncertainty Explanation (CLUE) for a given data point where the model is uncertain. We broaden the exploration to examine δ-CLUE, the set of potential CLUEs within a δ ball of the original input in latent space. We study the diversity of such sets and find that many CLUEs are redundant; as such, we propose DIVerse CLUE (∇-CLUE), a set of CLUEs which each propose a distinct explanation as to how one can decrease the uncertainty associated with an input. We then further propose GLobal AMortised CLUE (GLAM-CLUE), a distinct, novel method which learns amortised mappings that apply to specific groups of uncertain inputs, taking them and efficiently transforming them in a single function call into inputs for which a model will be certain. Our experiments show that δ-CLUE, ∇-CLUE, and GLAM-CLUE all address shortcomings of CLUE and provide beneficial explanations of uncertainty estimates to practitioners.

#### Chefs' Random Tables: Non-Trigonometric Random Features

Valerii Likhosherstov, Krzysztof Choromanski, Avinava Dubey, Frederick Liu, Tamas Sarlos, Adrian Weller, 2022. arXiv. **DOI**: 10.48550/ARXIV.2205.15317.

Abstract▼ URL

We introduce chefs’ random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality d (not only asymptotically for sufficiently large d, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.

#### You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction

O. Makansi, J. von Kügelgen, F. Locatello, P. Gehler, D. Janzing, T. Brox, B. Schölkopf, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

Predicting the future trajectory of a moving agent can be easy when the past trajectory continues smoothly but is challenging when complex interactions with other agents are involved. Recent deep learning approaches for trajectory prediction show promising performance and partially attribute this to successful reasoning about agent-agent interactions. However, it remains unclear which features such black-box models actually learn to use for making predictions. This paper proposes a procedure that quantifies the contributions of different cues to model performance based on a variant of Shapley values. Applying this procedure to state-of-the-art trajectory prediction methods on standard benchmark datasets shows that they are, in fact, unable to reason about interactions. Instead, the past trajectory of the target is the only feature used for predicting its future. For a task with richer social interaction patterns, on the other hand, the tested models do pick up such interactions to a certain extent, as quantified by our feature attribution method. We discuss the limits of the proposed method and its links to causality.

#### Practical Conditional Neural Processes via Tractable Dependent Predictions

Stratis Markou, James Requeima, Wessel P. Bruinsma, Anna Vaughan, Richard E. Turner, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

Conditional Neural Processes (CNPs; Garnelo et al., 2018) are meta-learning models which leverage the flexibility of deep learning to produce well-calibrated predictions and naturally handle off-the-grid and missing data. CNPs scale to large datasets and train with ease. Due to these features, CNPs appear well-suited to tasks from environmental sciences or healthcare. Unfortunately, CNPs do not produce correlated predictions, making them fundamentally inappropriate for many estimation and decision making tasks. Predicting heat waves or floods, for example, requires modelling dependencies in temperature or precipitation over time and space. Existing approaches which model output dependencies, such as Neural Processes (NPs; Garnelo et al., 2018b) or the FullConvGNP (Bruinsma et al., 2021), are either complicated to train or prohibitively expensive. What is needed is an approach which provides dependent predictions, but is simple to train and computationally tractable. In this work, we present a new class of Neural Process models that make correlated predictions and support exact maximum likelihood training that is simple and scalable. We extend the proposed models by using invertible output transformations, to capture non-Gaussian output distributions. Our models can be used in downstream estimation tasks which require dependent function samples. By accounting for output dependencies, our models show improved predictive performance on a range of experiments with synthetic and real data.

#### Information-theoretic Inducing Point Placement for High-Throughput Bayesian Optimisation

Henry B. Moss, Sebastian W. Ober, Victor Picheny, 2022. (In ICML Workshop on Adaptive Experimental Design and Active Learning in the Real World (RealML)).

Abstract▼ URL

Sparse Gaussian Processes are a key component of high-throughput Bayesian optimisation (BO) loops — an increasingly common setting where evaluation budgets are large and highly parallelised. By using representative subsets of the available data to build approximate posteriors, sparse models dramatically reduce the computational costs of surrogate modelling by relying on a small set of pseudo-observations, the so-called inducing points, in lieu of the full data set. However, current approaches to design inducing points are not appropriate within BO loops as they seek to reduce global uncertainty in the objective function. Thus, the high-fidelity modelling of promising and data-dense regions required for precise optimisation is sacrificed and computational resources are instead wasted on modelling areas of the space already known to be sub-optimal. Inspired by entropy-based BO methods, we propose a novel inducing point design that uses a principled information-theoretic criterion to select inducing points. By choosing inducing points to maximally reduce both global uncertainty and uncertainty in the maximum value of the objective function, we build surrogate models able to support high-precision high-throughput BO.

#### Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning Few-Shot Meta-Learners

Elre T. Oldewage, John Bronskill, Richard E. Turner, 2022. (In I Can't Believe It's Not Better, Workshop at Neurips 2022).

Abstract▼

This paper examines the robustness of deployed few-shot meta-learning systems when they are fed an imperceptibly perturbed few-shot dataset. We attack amortized meta-learners, which allows us to craft colluding sets of inputs that are tailored to fool the system’s learning algorithm when used as training data. Jointly crafted adversarial inputs might be expected to synergistically manipulate a classifier, allowing for very strong data-poisoning attacks that would be hard to detect. We show that in a white box setting, these attacks are very successful and can cause the target model’s predictions to become worse than chance. However, in opposition to the well-known transferability of adversarial examples in general, the colluding sets do not transfer well to different classifiers. We explore two hypotheses to explain this: ‘overfitting’ by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred. Regardless of the mitigation strategies suggested by these hypotheses, the colluding inputs transfer no better than adversarial inputs that are generated independently in the usual way.

#### The UK Algorithmic Transparency Standard: A Qualitative Analysis of Police Perspectives

Marion Oswald, Luke Chambers, Ellen P Goodman, Pam Ugwudike, Miri Zilka, 2022. (Available at SSRN).

Abstract▼ URL

- The UK Government’s draft ‘Algorithmic Transparency Standard’ is intended to provide a standardised way for public bodies and government departments to provide information about how algorithmic tools are being used to support decisions. The research discussed in this report was conducted in parallel to the piloting of the Standard by the Cabinet Office and the Centre for Data Ethics and Innovation. 2. We conducted semi-structured interviews with respondents from across UK policing and commercial bodies involved in policing technologies. Our aim was to explore the implications for police forces of participation in the Standard, to identify rewards, risks, challenges for the police, and areas where the Standard could be improved, and therefore to contribute to the exploration of policy options for expansion of participation in the Standard. 3. Algorithmic transparency is both achievable for policing and could bring significant rewards. A key reward of police participation in the Standard is that it provides the opportunity to demonstrate proficient implementation of technology-driven policing, thus enhancing earned trust. Research participants highlighted the public good that could result from the considered use of algorithms. 4. Participants noted, however, a risk of misperception of the dangers of policing technology, especially if use of algorithmic tools was not appropriately compared to the status quo and current methods. 5. Participation in the Standard provides an opportunity to develop increased sharing among police forces of best practices (and things to avoid), and increased thoughtfulness among police force personnel in building and implementing new tools. Research participants were keen for compliance with the Standard to become an integral part of a holistic system to drive reflective practice across policing around the development and deployment of algorithmic technology. This could enable police to learn from each other, facilitate good policy choices and decrease wasted costs. Otherwise, the Standard may come to be regarded as an administrative burden rather than a benefit for policing. 6. Several key areas for amendment and improvement from the perspective of policing were identified in the research. These could improve the Standard for the benefit of all participants. These include a need for clarification of the scope of the Standard, and the stage of project development at which the Standard should apply. It is recommended that consideration be given to a ‘Standard-Lite’ for projects at the pilot or early stages of the development process in order to gain public understanding of new tools and applications. Furthermore, the Standard would benefit from a more substantial glossary (to include relevant policing terms) and additional guidance on the level of detail required in each section and how accuracy rates should be described, justified and explained in order to ensure consistency. 7. The research does not suggest any overriding reason why the Standard should not be applied in policing. Suitable exemptions for sensitive contexts and tradecraft would be required, however, and consideration given to ensuring that forces have the resources to comply with the Standard and to respond to the increased public interest that could ensue. Limiting the scope initially to tools on a defined list (to include the most high-risk tools, such as those that produce individualised risk/predictive scores) could assist in mitigating concerns over sensitive policing capabilities and resourcing. A non-public version of the Standard for sensitive applications and tools could also be considered, which would be available to bodies with an independent oversight function. 8. To support police compliance with the Standard, supplier responsibilities – including appropriate disclosure of algorithmic functionality, data inputs and performance – should be covered in procurement contracts and addressed up front as a mandatory requirement of doing business with the police. 9. As well as contributing to the piloting of the Standard, it is recommended that the findings of this report are considered at NPCC level, by the College of Policing and by the office of the Chief Scientific Advisor for Policing, as new sector-led guidance, best practice and policy are developed.

#### Causal Discovery in Heterogeneous Environments Under the Sparse Mechanism Shift Hypothesis

R. Perry, J. von Kügelgen*, B. Schölkopf*, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc.. **Note**: *shared last author.

Abstract▼ URL

Machine learning approaches commonly rely on the assumption of independent and identically distributed (i.i.d.) data. In reality, however, this assumption is almost always violated due to distribution shifts between environments. Although valuable learning signals can be provided by heterogeneous data from changing distributions, it is also known that learning under arbitrary (adversarial) changes is impossible. Causality provides a useful framework for modeling distribution shifts, since causal models encode both observational and interventional distributions. In this work, we explore the sparse mechanism shift hypothesis, which posits that distribution shifts occur due to a small number of changing causal conditionals. Motivated by this idea, we apply it to learning causal structure from heterogeneous environments, where i.i.d. data only allows for learning an equivalence class of graphs without restrictive assumptions. We propose the Mechanism Shift Score (MSS), a score-based approach amenable to various empirical estimators, which provably identifies the entire causal structure with high probability if the sparse mechanism shift hypothesis holds. Empirically, we verify behavior predicted by the theory and compare multiple estimators and score functions to identify the best approaches in practice. Compared to other methods, we show how MSS bridges a gap by both being nonparametric as well as explicitly leveraging sparse changes.

#### Spectral Diffusion Processes

Angus Phillips, Thomas Seror, Michael Hutchinson, Valentin De Bortoli, Arnaud Doucet, Emile Mathieu, 2022. (In NeurIPS workshop on Score-Based Methods).

Abstract▼ URL

Score-based generative modelling (SGM) has proven to be a very effective method for modelling densities on finite-dimensional spaces. In this work we propose to extend this methodology to learn generative models over functional spaces. To do so, we represent functional data in spectral space to dissociate the stochastic part of the processes from their space-time part. Using dimensionality reduction techniques we then sample from their stochastic component using finite dimensional SGM. We demonstrate our method’s effectiveness for modelling various multimodal datasets.

#### Challenges and Pitfalls of Bayesian Unlearning

Ambrish Rawat, James Requeima, Wessel Bruinsma, Richard Turner, 2022. (In ICML 2022 Workshop on Updatable Machine Learning (UpML)).

Abstract▼ URL

Machine unlearning refers to the task of removing a subset of training data, thereby removing its contributions to a trained model. Approximate unlearning are one class of methods for this task which avoid the need to retrain the model from scratch on the retained data. Bayes’ rule can be used to cast approximate unlearning as an inference problem where the objective is to obtain the updated posterior by dividing out the likelihood of deleted data. However this has its own set of challenges as one often doesn’t have access to the exact posterior of the model parameters. In this work we examine the use of the Laplace approximation and Variational Inference to obtain the updated posterior. With a neural network trained for a regression task as the guiding example, we draw insights on the applicability of Bayesian unlearning in practical scenarios.

#### Embrace the Gap: VAEs Perform Independent Mechanism Analysis

P. Reizinger*, L. Gresele*, J. Brady*, J. von Kügelgen, D. Zietlow, B. Schölkopf, G. Martius, W. Brendel, M. Besserve, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc.. Note: *equal first authorship.

Abstract▼ URL

Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder – a commonly used but unproven conjecture – which we refer to as self-consistency. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.

#### Last layer marginal likelihood for invariance learning

Pola E. Schwöbel, Martin Jørgensen, Sebastian W. Ober, Mark van der Wilk, 2022. (In 25th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Data augmentation is often used to incorporate inductive biases into models. Traditionally, these are hand-crafted and tuned with cross validation. The Bayesian paradigm for model selection provides a path towards end-to-end learning of invariances using only the training data, by optimising the marginal likelihood. Computing the marginal likelihood is hard for neural networks, but success with tractable approaches that compute the marginal likelihood for the last layer only raises the question of whether this convenient approach might be employed for learning invariances. We show partial success on standard benchmarks, in the low-data regime and on a medical imaging dataset by designing a custom optimisation routine. Introducing a new lower bound to the marginal likelihood allows us to perform inference for a larger class of likelihood functions than before. On the other hand, we demonstrate failure modes on the CIFAR10 dataset, where the last layer approximation is not sufficient due to the increased complexity of our neural network. Our results indicate that once more sophisticated approximations become available the marginal likelihood is a promising approach for invariance learning in neural networks.

#### Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

L. Schott, J. von Kügelgen, F. Träuble, P. Gehler, C. Russell, M. Bethge, B. Schölkopf, F. Locatello, W. Brendel, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

An important component for generalization in machine learning is to uncover underlying latent factors of variation as well as the mechanism through which each factor acts in the world. In this paper, we test whether 17 unsupervised, weakly supervised, and fully supervised representation learning approaches correctly infer the generative factors of variation in simple datasets (dSprites, Shapes3D, MPI3D) from controlled environments, and on our contributed CelebGlow dataset. In contrast to prior robustness work that introduces novel factors of variation during test time, such as blur or other (un)structured noise, we here recompose, interpolate, or extrapolate only existing factors of variation from the training data set (e.g., small and medium-sized objects during training and large objects during testing). Models that learn the correct mechanism should be able to generalize to this benchmark. In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. Moreover, the generalization capabilities of all tested models drop significantly as we move from artificial datasets towards more realistic real-world datasets. Despite their inability to identify the correct mechanism, the models are quite modular as their ability to infer other in-distribution factors remains fairly stable, providing only a single factor is out-of-distribution. These results point to an important yet understudied problem of learning mechanistic models of observations that can facilitate generalization.

#### Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker, 2022. (arXiv preprint arXiv:2209.10015).

Abstract▼ URL

Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy’ or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology – uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

**Comment:** Project webpage: https://metadata-archaeology.github.io/

#### Defining and Characterizing Reward Hacking

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, David Krueger, 2022. (In Advances in Neural Information Processing Systems 35).

#### Advances in Software and Spatio-Temporal Modelling with Gaussian Processes

Will Tebbutt, 2022. University of Cambridge, Department of Engineering,

Abstract▼ URL

This thesis concerns the use of Gaussian processes (GPs) as distributions over unknown functions in Machine Learning and probabilistic modeling. GPs have been found to have utility in a wide range of applications owing to their flexibility, interpretability, and tractability. I advance their use in three directions. Firstly, the abstractions upon which software is built for their use in practice. In modern GP software libraries such as GPML, GPy, GPflow, and GPyTorch, the kernel is undoubtedly the dominant abstraction. While it remains highly successful it of course has limitations, and I propose to address some of these through a complementary abstraction: affine transformations of GPs. Specifically I show how a collection of GPs, and affine transformations thereof, can themselves be treated as a single GP. This in turn leads to a design for software, including exact and approximate inference algorithms. I demonstrate the utility of this software through a collection of worked examples, focussing on models which are more cleanly and easily expressed using this new software. Secondly, I develop a new scalable approximate inference algorithm for a class of GPs commonly utilised in spatio-temporal problems. This is a setting in which GPs excel, for example enabling the incorporation of important inductive biases, and observations made at arbitrary points in time and space. However, the computation required to perform exact inference and learning in GPs scales cubically in the number of observations, necessitating approximation, to which end I combine two important complementary classes of approximation: pseudo-point and Markovian. The key contribution is the insight that a simple and useful way to combine them turns out to be well-justified. This resolves an open question in the literature, provides new insight into existing work, and a new family of approximations. The efficacy of an important member of this family is demonstrated empirically. Finally I develop a GP model and associated approximate inference techniques for the prediction of sea surface temperatures (SSTs) on decadal time scales, which are relevant when taking planning decisions which consider resilience to climate change. There remains a large degree of uncertainty as to the state of the climate on such time scales, but it is thought to be possible to reduce this by exploiting the predictability of natural variability in the climate. The developed GP-based model incorporates a key assumption used by the existing statistical models employed for decadal prediction, thus retaining a valuable inductive bias, while offering several advantages. Amongst these is the lack of need for spatial aggregation of data, which is especially relevant when data are sparse, as is the case with historical ocean SST data. In summary, this thesis contributes to the practical use of GPs through a set of abstractions that are useful in the design of software, algorithms for approximate inference in spatial-temporal settings, and their use in decadal climate prediction.

#### Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge, 2022. (arXiv).

Abstract▼ URL

As Gaussian processes mature, they are increasingly being deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. We derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We evaluate the proposed techniques on a number of examples, showing that, in geospatial settings, sparse approximations with guaranteed numerical stability often perform comparably to those without.

#### Active Bayesian Causal Inference

C. Toth, L. Lorch, C. Knoll, A. Krause, F. Pernkopf, R. Peharz*, J. von Kügelgen*, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc.. **Note**: *shared last author.

Abstract▼ URL

Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference – other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.

#### An Evaluation Framework for the Objective Functions of De Novo Drug Design Benchmarks

Austin Tripp, Wenlin Chen, José Miguel Hernández-Lobato, 2022. (In ICLR 2022 Workshop on Machine Learning for Drug Discovery).

Abstract▼ URL

De novo drug design has recently received increasing attention from the machine learning community. It is important that the field is aware of the actual goals and challenges of drug design and the roles that de novo molecule design algorithms could play in accelerating the process, so that algorithms can be evaluated in a way that reflects how they would be applied in real drug design scenarios. In this paper, we propose a framework for critically assessing the merits of benchmarks, and argue that most of the existing de novo drug design benchmark functions are either highly unrealistic or depend upon a surrogate model whose performance is not well characterized. In order for the field to achieve its long-term goals, we recommend that poor benchmarks (especially logP and QED) be deprecated in favour of better benchmarks. We hope that our proposed framework can play a part in developing new de novo drug design benchmarks that are more realistic and ideally incorporate the intrinsic goals of drug design.

#### A Survey and Datasheet Repository of Publicly Available US Criminal Justice Datasets

Miri Zilka, Bradley Butcher, Adrian Weller, 2022. (Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track).

Abstract▼ URL

Criminal justice is an increasingly important application domain for machine learning and algorithmic fairness, as predictive tools are becoming widely used in police, courts, and prison systems worldwide. A few relevant benchmarks have received significant attention, e.g., the COMPAS dataset, often without proper consideration of the domain context. To raise awareness of publicly available criminal justice datasets and encourage their responsible use, we conduct a survey, consider contexts, highlight potential uses, and identify gaps and limitations. We provide datasheets for 15 datasets and upload them to a public repository. We compare the datasets across several dimensions, including size, coverage of the population, and potential use, highlighting concerns. We hope that this work can provide a useful starting point for researchers looking for appropriate datasets related to criminal justice, and that the repository will continue to grow as a community effort.

#### Transparency, Governance and Regulation of Algorithmic Tools Deployed in the Criminal Justice System: A UK Case Study

Miri Zilka, Holli Sargeant, Adrian Weller, 2022. (Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society).

Abstract▼ URL

We present a survey of tools used in the criminal justice system in the UK in three categories: data infrastructure, data analysis, and risk prediction. Many tools are currently in deployment, offering potential benefits, including improved efficiency and consistency. However, there are also important concerns. Transparent information about these tools, their purpose, how they are used, and by whom is difficult to obtain. Even when information is available, it is often insufficient to enable a satisfactory evaluation. More work is needed to establish governance mechanisms to ensure that tools are deployed in a transparent, safe and ethical way. We call for more engagement with stakeholders and greater documentation of the intended goal of a tool, how it will achieve this goal compared to other options, and how it will be monitored in deployment. We highlight additional points to consider when evaluating the trustworthiness of deployed tools and make concrete proposals for policy.

#### Provable lifelong learning of representations

Xinyuan Cao, Weiyang Liu, Santosh Vempala, 2022. (In International Conference on Artificial Intelligence and Statistics).

#### Scalable Infomin Learning

Yanzhi Chen, Weihao Sun, Yingzhen Li, Adrian Weller, 2022. (In Advances in Neural Information Processing Systems).

Abstract▼ URL

The task of infomin learning aims to learn a representation with high utility while being uninformative about a specified target, with the latter achieved by minimising the mutual information between the representation and the target. It has broad applications, ranging from training fair prediction models against protected attributes, to unsupervised learning with disentangled representations. Recent works on infomin learning mainly use adversarial training, which involves training a neural network to estimate mutual information or its proxy and thus is slow and difficult to optimise. Drawing on recent advances in slicing techniques, we propose a new infomin learning approach, which uses a novel proxy metric to mutual information. We further derive an accurate and analytically computable approximation to this proxy metric, thereby removing the need of constructing neural network-based mutual information estimators. Compared to baselines, experiments on algorithmic fairness, disentangled representation learning and domain adaptation verify that our method can more effectively remove unwanted information with limited time budget.

#### Identifying causes of Pyrocumulonimbus (PyroCb)

Emiliano Diaz, Kenza Tazi, Ashwin S Braude, Daniel Okoh, Kara Lamb, Duncan Watson-Parris, Paula Harder, Nis Meinert, 2022. (In NeurIPS Workshop on Causality for Real-world Impact).

Abstract▼ URL

A first causal discovery analysis from observational data of pyroCb (storm clouds generated from extreme wildfires) is presented. Invariant Causal Prediction was used to develop tools to understand the causal drivers of pyroCb formation. This includes a conditional independence test for testing Y conditionally independent of E given X for binary variable Y and multivariate, continuous variables X and E, and a greedy-ICP search algorithm that relies on fewer conditional independence tests to obtain a smaller more manageable set of causal predictors. With these tools, we identified a subset of seven causal predictors which are plausible when contrasted with domain knowledge: surface sensible heat flux, relative humidity at 850 hPa, a component of wind at 250 hPa, 13.3 micro-meters, thermal emissions, convective available potential energy, and altitude

#### Pre-training Molecular Graph Representation with 3D Geometry

Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, Jian Tang, 2022. (In International Conference on Learning Representations).

#### SphereFace Revived: Unifying Hyperspherical Face Recognition

Weiyang Liu, Yandong Wen, Bhiksha Raj, Rita Singh, Adrian Weller, 2022. (IEEE Transactions on Pattern Analysis and Machine Intelligence). IEEE.

#### Structural Causal 3D Reconstruction

Weiyang Liu, Zhen Liu, Liam Paull, Adrian Weller, Bernhard Schölkopf, 2022. (In European Conference on Computer Vision).

#### Interoperability of statistical models in pandemic preparedness: principles and reality

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes, Sylvia Richardson, May 2022. (Stat. Sci.).

Abstract▼ URL

We present interoperability as a guiding framework for statistical modelling to assist policy makers asking multiple questions using diverse datasets in the face of an evolving pandemic response. Interoperability provides an important set of principles for future pandemic preparedness, through the joint design and deployment of adaptable systems of statistical models for disease surveillance using probabilistic reasoning. We illustrate this through case studies for inferring and characterising spatial-temporal prevalence and reproduction numbers of SARS-CoV-2 infections in England.

#### Pyrocast: a machine learning pipeline to forecast pyrocumulonimbus (pyrocb) clouds

Kenza Tazi, Emiliano Díaz Salas-Porras, Ashwin Braude, Daniel Okoh, Kara D Lamb, Duncan Watson-Parris, Paula Harder, Nis Meinert, 2022. (NeurIPS Workshop on Tackling Climate Change with Machine Learning).

Abstract▼ URL

Pyrocumulonimbus (pyroCb) clouds are storm clouds generated by extreme wildfires. PyroCbs are associated with unpredictable, and therefore dangerous, wildfire spread. They can also inject smoke particles and trace gases into the upper troposphere and lower stratosphere, affecting the Earth’s climate. As global temperatures increase, these previously rare events are becoming more common. Being able to predict which fires are likely to generate pyroCb is therefore key to climate adaptation in wildfire-prone areas. This paper introduces Pyrocast, a pipeline for pyroCb analysis and forecasting. The pipeline’s first two components, a pyroCb database and a pyroCb forecast model, are presented. The database brings together geostationary imagery and environmental data for over 148 pyroCb events across North America, Australia, and Russia between 2018 and 2022. Random Forests, Convolutional Neural Networks (CNNs), and CNNs pretrained with Auto-Encoders were tested to predict the generation of pyroCb for a given fire six hours in advance. The best model predicted pyroCb with an AUC of 0.90±0.04.

#### SphereFace2: Binary Classification is All You Need for Deep Face Recognition

Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, Rita Singh, 2022. (In International Conference on Learning Representations).

#### Towards principled disentanglement for domain generalization

Hanlin Zhang, Yi-Fan Zhang, Weiyang Liu, Adrian Weller, Bernhard Schölkopf, Eric P Xing, 2022. (In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition).

## 2021

#### Deep kernel processes

Laurence Aitchison, Adam X. Yang, Sebastian W. Ober, 2021. (In 38th International Conference on Machine Learning).

Abstract▼ URL

We define deep kernel processes in which positive definite Gram matrices are progressively transformed by nonlinear kernel functions and by sampling from (inverse) Wishart distributions. Remarkably, we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs), infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep kernel processes. For DGPs the equivalence arises because the Gram matrix formed by the inner product of features is Wishart distributed, and as we show, standard isotropic kernels can be written entirely in terms of this Gram matrix — we do not need knowledge of the underlying features. We define a tractable deep kernel process, the deep inverse Wishart process, and give a doubly-stochastic inducing-point variational inference scheme that operates on the Gram matrices, not on the features, as in DGPs. We show that the deep inverse Wishart process gives superior performance to DGPs and infinite BNNs on fully-connected baselines.

#### Getting a CLUE: A Method for Explaining Uncertainty Estimates

Javier Antorán, Umang Bhatt, Tameem Adel, Adrian Weller, José Miguel Hernández-Lobato, April 2021. (In 9th International Conference on Learning Representations).

Abstract▼ URL

Both uncertainty estimation and interpretability are important factors for trustworthy machine learning systems. However, there is little work at the intersection of these two areas. We address this gap by proposing a novel method for interpreting uncertainty estimates from differentiable probabilistic models, like Bayesian Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty Explanations (CLUE), indicates how to change an input, while keeping it on the data manifold, such that a BNN becomes more confident about the input’s prediction. We validate CLUE through 1) a novel framework for evaluating counterfactual explanations of uncertainty, 2) a series of ablation experiments, and 3) a user study. Our experiments show that CLUE outperforms baselines and enables practitioners to better understand which input patterns are responsible for predictive uncertainty..

#### Tighter Bounds on the Log Marginal Likelihood of Gaussian Process Regression Using Conjugate Gradients

Artem Artemev, David R. Burt, Mark van der Wilk, 2021. (In 38th International Conference on Machine Learning).

Abstract▼ URL

We propose a lower bound on the log marginal likelihood of Gaussian process regression models that can be computed without matrix factorisation of the full kernel matrix. We show that approximate maximum likelihood learning of model parameters by maximising our lower bound retains many benefits of the sparse variational approach while reducing the bias introduced into hyperparameter learning. The basis of our bound is a more careful analysis of the log-determinant term appearing in the log marginal likelihood, as well as using the method of conjugate gradients to derive tight lower bounds on the term involving a quadratic form. Our approach is a step forward in unifying methods relying on lower bound maximisation (e.g. variational methods) and iterative approaches based on conjugate gradients for training Gaussian processes. In experiments, we show improved predictive performance with our model for a comparable amount of training time compared to other conjugate gradient based approaches.

#### Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty

Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, others, 2021. (In 4th AAAI/ACM Conference on Artificial Intelligence, Ethics and Society).

Abstract▼ URL

Algorithmic transparency entails exposing system properties to various stakeholders for purposes that include understanding, improving, and contesting predictions. Until now, most research into algorithmic transparency has predominantly focused on explainability. Explainability attempts to provide reasons for a machine learning model’s behavior to stakeholders. However, understanding a model’s specific behavior alone might not be enough for stakeholders to gauge whether the model is wrong or lacks sufficient knowledge to solve the task at hand. In this paper, we argue for considering a complementary form of transparency by estimating and communicating the uncertainty associated with model predictions. First, we discuss methods for assessing uncertainty. Then, we characterize how uncertainty can be used to mitigate model unfairness, augment decision-making, and build trustworthy systems. Finally, we outline methods for displaying uncertainty to stakeholders and recommend how to collect information required for incorporating uncertainty into existing ML pipelines. This work constitutes an interdisciplinary review drawn from literature spanning machine learning, visualization/HCI, design, decision-making, and fairness. We aim to encourage researchers and practitioners to measure, communicate, and use uncertainty as a form of transparency.

#### Memory efficient meta-learning with large images

John Bronskill, Daniela Massiceti, Massimiliano Patacchiola, Katja Hofmann, Sebastian Nowozin, Richard E. Turner, 2021. (In Advances in Neural Information Processing Systems 35).

Abstract▼ URL

Meta learning approaches to few-shot classification are computationally efficient at test time, requiring just a few optimization steps or single forward pass to learn a new task, but they remain highly memory-intensive to train. This limitation arises because a task’s entire support set, which can contain up to 1000 images, must be processed before an optimization step can be taken. Harnessing the performance gains offered by large images thus requires either parallelizing the meta-learner across multiple GPUs, which may not be available, or trade-offs between task and image size when memory constraints apply. We improve on both options by proposing LITE, a general and memory efficient episodic training scheme that enables meta-training on large tasks composed of large images on a single GPU. We achieve this by observing that the gradients for a task can be decomposed into a sum of gradients over the task’s training images. This enables us to perform a forward pass on a task’s entire training set but realize significant memory savings by back-propagating only a random subset of these images which we show is an unbiased approximation of the full gradient. We use LITE to train meta-learners and demonstrate new state-of-the-art accuracy on the real-world ORBIT benchmark and 3 of the 4 parts of the challenging VTAB+ MD benchmark relative to leading meta-learners. LITE also enables meta-learners to be competitive with transfer learning approaches but at a fraction of the test-time computational cost, thus serving as a counterpoint to the recent narrative that transfer learning is all you need for few-shot classification.

#### The Gaussian Neural Process

Wessel P. Bruinsma, James Requeima, Andrew Y. K. Foong, Jonathan Gordon, Richard E. Turner, 2021. (In 3rd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Neural Processes (NPs; Garnelo et al., 2018a,b) are a rich class of models for meta-learning that map data sets directly to predictive stochastic processes. We provide a rigorous analysis of the standard maximum-likelihood objective used to train conditional NPs. Moreover, we propose a new member to the Neural Process family called the Gaussian Neural Process (GNP), which models predictive correlations, incorporates translation equivariance, provides universal approximation guarantees, and demonstrates encouraging performance.

#### Understanding variational inference in function-space

David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, Mark van der Wilk, 2021. (In 3rd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Recent work has attempted to directly approximate the ‘function-space’ or predictive posterior distribution of Bayesian models, without approximating the posterior distribution over the parameters. This is appealing in e.g. Bayesian neural networks, where we only need the former, and the latter is hard to represent. In this work, we highlight some advantages and limitations of employing the Kullback-Leibler divergence in this setting. For example, we show that minimizing the KL divergence between a wide class of parametric distributions and the posterior induced by a (non-degenerate) Gaussian process prior leads to an ill-defined objective function. Then, we propose (featurized) Bayesian linear regression as a benchmark for ‘function-space’ inference methods that directly measures approximation quality. We apply this methodology to assess aspects of the objective function and inference scheme considered in Sun et al. (2018), emphasizing the quality of approximation to Bayesian inference as opposed to predictive performance.

#### Understanding Local Linearisation in Variational Gaussian Process State Space Models

Talay M Cheema, 2021. (In Time Series Workshop at the 38th International Conference on Machine Learning).

Abstract▼ URL

We describe variational inference approaches in Gaussian process state space models in terms of local linearisations of the approximate posterior function. Most previous approaches have either assumed independence between the posterior dynamics and latent states (the mean-field (MF) approximation), or optimised free parameters for both, leading to limited scalability. We use our framework to prove that (i) there is a theoretical imperative to use non-MF approaches, to avoid excessive bias in the process noise hyperparameter estimate, and (ii) we can parameterise only the posterior dynamics without any less of performance. Our approach suggests further approximations, based on the existing rich literature on filtering and smoothing for nonlinear systems, and unifies approaches for discrete and continuous time models.

#### Rethinking Attention with Performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller, 2021. (In International Conference on Learning Representations).

Abstract▼ URL

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

#### A Psychology-Driven Computational Analysis of Political Interviews

Darren Cook, Miri Zilka, Simon Maskell, Laurence Alison, 2021. (Proc. Interspeech).

Abstract▼ URL

Can an interviewer influence the cooperativeness of an interviewee? The role of an interviewer in actualising a successful interview is an active field of social psychological research. A large-scale analysis of interviews, however, typically involves time-exorbitant manual tasks and considerable human effort. Despite recent advances in computational fields, many automated methods continue to rely on manually labelled training data to establish ground-truth. This reliance obscures explainability and hinders the mobility of analysis between applications. In this work, we introduce a cross-disciplinary approach to analysing interviewer efficacy. We suggest computational success measures as a transparent, automated, and reproducible alternative for pre-labelled data. We validate these measures with a small-scale study with human-responders. To study the interviewer’s influence on the interviewee we utilise features informed by social psychological theory to predict interview quality based on the interviewer’s linguistic behaviour. Our psychologically informed model significantly outperforms a bag-of-words model, demonstrating the strength of a cross-disciplinary approach toward the analysis of conversational data at scale.

#### Bayesian Deep Learning via Subnetwork Inference

Erik A. Daxberger, Eric T. Nalisnick, James Urquhart Allingham, Javier Antorán, José Miguel Hernández-Lobato, 2021. (In 32nd International Conference on Machine Learning). Edited by Marina Meila, Tong Zhang. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

The Bayesian paradigm has the potential to solve core issues of deep neural networks such as poor calibration and data inefficiency. Alas, scaling Bayesian inference to large weight spaces often requires restrictive approximations. In this work, we show that it suffices to perform inference over a small subset of model weights in order to obtain accurate predictive posteriors. The other weights are kept as point estimates. This subnetwork inference framework enables us to use expressive, otherwise intractable, posterior approximations over such subsets. In particular, we implement subnetwork linearized Laplace: We first obtain a MAP estimate of all weights and then infer a full-covariance Gaussian posterior over a subnetwork. We propose a subnetwork selection strategy that aims to maximally preserve the model’s predictive uncertainty. Empirically, our approach is effective compared to ensembles and less expressive posterior approximations over full networks.

#### Deep Neural Networks as Point Estimates for Deep Gaussian Processes

Vincent Dutordoir, James Hensman, Mark van der Wilk, Carl Henrik Ek, Zoubin Ghahramani, Nicolas Durrande, Dec 2021. (In Advances in Neural Information Processing Systems 34). Online.

Abstract▼ URL

Neural networks and Gaussian processes are complementary in their strengths and weaknesses. Having a better understanding of their relationship comes with the promise to make each method benefit from the strengths of the other. In this work, we establish an equivalence between the forward passes of neural networks and (deep) sparse Gaussian process models. The theory we develop is based on interpreting activation functions as interdomain inducing features through a rigorous analysis of the interplay between activation functions and kernels. This results in models that can either be seen as neural networks with improved uncertainty prediction or deep Gaussian processes with increased prediction accuracy. These claims are supported by experimental results on regression and classification datasets.

#### How Tight Can PAC-Bayes Be in the Small Data Regime?

Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, Richard E. Turner, 2021. (In Advances in Neural Information Processing Systems 34). Curran Associates, Inc..

Abstract▼ URL

In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We ind that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.

#### Independent mechanisms analysis, a new concept?

L. Gresele*, J. von Kügelgen*, V. Stimper, B. Schölkopf, M. Besserve, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc.. **Note**: *equal contribution.

Abstract▼ URL

Independent component analysis provides a principled framework for unsupervised representation learning, with solid theory on the identifiability of the latent code that generated the data, given only observations of mixtures thereof. Unfortunately, when the mixing is nonlinear, the model is provably nonidentifiable, since statistical independence alone does not sufficiently constrain the problem. Identifiability can be recovered in settings where additional, typically observed variables are included in the generative process. We investigate an alternative path and consider instead including assumptions reflecting the principle of independent causal mechanisms exploited in the field of causality. Specifically, our approach is motivated by thinking of each source as independently influencing the mixing process. This gives rise to a framework which we term independent mechanism analysis. We provide theoretical and empirical evidence that our approach circumvents a number of nonidentifiability issues arising in nonlinear blind source separation.

#### On component interactions in two-stage recommender systems

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, 2021. (NeurIPS).

Abstract▼ URL

Thanks to their scalability, two-stage recommenders are used by many of today’s largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators, tuned for low prediction latency, preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at K.

#### Scalable Gaussian Process Variational Autoencoders

Metod Jazbec, Matt Ashman, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. Proceedings of Machine Learning Research. Proceedings of Machine Learning Research.

Abstract▼ URL

Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.

#### Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP

Z. Jin*, J. von Kügelgen*, J. Ni, T. Vaidhya, A. Kaushal, M. Sachan, B. Schölkopf, 2021. (In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)). Edited by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih. Association for Computational Linguistics. **DOI**: 10.18653/v1/2021.emnlp-main.748. **Note**: *equal contribution.

Abstract▼ URL

The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices.

#### Self-supervised learning with data augmentations provably isolates content from style

J. von Kügelgen*, Y. Sharma*, L. Gresele*, W. Brendel, B. Schölkopf, M. Besserve, F. Locatello, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc.. Note: *equal contribution.

Abstract▼ URL

Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.

#### PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, Mostafa Dehghani, 2021. (CoRR).

Abstract▼ URL

Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on image, audio and video which answers this question. By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task and achieve state-of-the-art results on 5 standard video- and audio-classification datasets. Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains. Moreover, we show that co-training is simple and practical to implement, as we do not need to tune hyperparameters for each combination of datasets, but can simply adapt those from standard, single-task training.

#### Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov, Krzysztof M Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller, 2021. (In Advances in Neural Information Processing Systems). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring O(L2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory during training, and still requires O(L) time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

#### CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices

Valerii Likhosherstov, Jared Davis, Krzysztof Choromanski, Adrian Weller, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcome low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform – a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction.

#### Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Q Davis, Adrian Weller, 18–24 Jul 2021. (In Proceedings of the 38th International Conference on Machine Learning). Edited by Marina Meila, Tong Zhang. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Approximate bi-level optimization (ABLO) consists of (outer-level) optimization problems, involving numerical (inner-level) optimization loops. While ABLO has many applications across deep learning, it suffers from time and memory complexity proportional to the length r of its inner optimization loop. To address this complexity, an earlier first-order method (FOM) was proposed as a heuristic which omits second derivative terms, yielding significant speed gains and requiring only constant memory. Despite FOM’s popularity, there is a lack of theoretical understanding of its convergence properties. We contribute by theoretically characterizing FOM’s gradient bias under mild assumptions. We further demonstrate a rich family of examples where FOM-based SGD does not converge to a stationary point of the ABLO objective. We address this concern by proposing an unbiased FOM (UFOM) enjoying constant memory complexity as a function of r. We characterize the introduced time-variance tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a given ABLO problem. Finally, we propose an efficient adaptive UFOM scheme.

#### Iterative Amortized Policy Optimization

Joseph Marino, Alexandre Piche, Alessandro Davide Ialongo, Yisong Yue, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when used with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, direct amortized mappings can yield suboptimal policy estimates and restricted distributions, limiting performance and exploration. Given this perspective, we consider the more flexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.

#### Orbit: A real-world few-shot dataset for teachable object recognition

Daniela Massiceti, Luisa Zintgraf, John Bronskill, Lida Theodorou, Matthew Tobias Harris, Edward Cutrell, Cecily Morrison, Katja Hofmann, Simone Stumpf, 2021. (In Proceedings of the IEEE/CVF International Conference on Computer Vision).

Abstract▼ URL

Object recognition has made great advances in the last decade, but predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset and benchmark, grounded in the real-world application of teachable object recognizers for people who are blind/low-vision. The dataset contains 3,822 videos of 486 objects recorded by people who are blind/low-vision on their mobile phones. The benchmark reflects a realistic, highly challenging recognition problem, providing a rich playground to drive research in robustness to few-shot, high-variation conditions. We set the benchmark’s first state-of-the-art and show there is massive scope for further innovation, holding the potential to impact a broad range of real-world vision applications including tools for the blind/low-vision community.

#### Addressing Bias in Active Learning with Depth Uncertainty Networks… or Not

Chelsea Murray, James Urquhart Allingham, Javier Antorán, José Miguel Hernández-Lobato, 2021. (In I (Still) Can't Believe It's Not Better! Workshop at NeurIPS 2021, Virtual Workshop, December 13, 2021). Edited by Melanie F. Pradier, Aaron Schein, Stephanie L. Hyland, Francisco J. R. Ruiz, Jessica Zosa Forde. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Farquhar et al. 2021 show that correcting for active learning bias with underparameterised models leads to improved downstream performance. For overparameterised models such as NNs, however, correction leads either to decreased or unchanged performance. They suggest that this is due to an “overfitting bias” which offsets the active learning bias. We show that depth uncertainty networks operate in a low overfitting regime, much like underparameterised models. They should therefore see an increase in performance with bias correction. Surprisingly, they do not. We propose that this negative result, as well as the results Farquhar et al. 2021, can be explained via the lens of the bias-variance decomposition of generalisation error.

#### Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes

Sebastian W. Ober, Laurence Aitchison, 2021. (In 38th International Conference on Machine Learning).

Abstract▼ URL

We consider the optimal approximate posterior over the top-layer weights in a Bayesian neural network for regression, and show that it exhibits strong dependencies on the lower-layer weights. We adapt this result to develop a correlated approximate posterior over the weights at all layers in a Bayesian neural network. We extend this approach to deep Gaussian processes, unifying inference in the two model classes. Our approximate posterior uses learned “global” inducing points, which are defined only at the input layer and propagated through the network to obtain inducing inputs at subsequent layers. By contrast, standard “local”, inducing point methods from the deep Gaussian process literature optimise a separate set of inducing inputs at every layer, and thus do not model correlations across layers. Our method gives state-of-the-art performance for a variational Bayesian method, without data augmentation or tempering, on CIFAR-10 of 86.7%, which is comparable to SGMCMC without tempering but with data augmentation (88% in Wenzel et al. 2020).

#### A variational approximate posterior for the deep Wishart process

Sebastian W. Ober, Laurence Aitchison, 2021. (In Advances in Neural Information Processing Systems 34).

Abstract▼ URL

Recent work introduced deep kernel processes as an entirely kernel-based alternative to NNs (Aitchison et al. 2020). Deep kernel processes flexibly learn good top-layer representations by alternately sampling the kernel from a distribution over positive semi-definite matrices and performing nonlinear transformations. A particular deep kernel process, the deep Wishart process (DWP), is of particular interest because its prior can be made equivalent to deep Gaussian process (DGP) priors for kernels that can be expressed entirely in terms of Gram matrices. However, inference in DWPs has not yet been possible due to the lack of sufficiently flexible distributions over positive semi-definite matrices. Here, we give a novel approach to obtaining flexible distributions over positive semi-definite matrices by generalising the Bartlett decomposition of the Wishart probability density. We use this new distribution to develop an approximate posterior for the DWP that includes dependency across layers. We develop a doubly-stochastic inducing-point inference scheme for the DWP and show experimentally that inference in the DWP can improve performance over doing inference in a DGP with the equivalent prior.

#### The promises and pitfalls of deep kernel learning

Sebastian W. Ober, Carl Edward Rasmussen, Mark van der Wilk, 2021. (In 37th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

Deep kernel learning (DKL) and related techniques aim to combine the representational power of neural networks with the reliable uncertainty estimates of Gaussian processes. One crucial aspect of these models is an expectation that, because they are treated as Gaussian process models optimized using the marginal likelihood, they are protected from overfitting. However, we identify situations where this is not the case. We explore this behavior, explain its origins and consider how it applies to real datasets. Through careful experimentation on the UCI, CIFAR-10, and the UTKFace datasets, we find that the overfitting from overparameterized maximum marginal likelihood, in which the model is “somewhat Bayesian”, can in certain scenarios be worse than that from not being Bayesian at all. We explain how and when DKL can still be successful by investigating optimization dynamics. We also find that failures of DKL can be rectified by a fully Bayesian treatment, which leads to the desired performance improvements over standard neural networks and Gaussian processes.

#### Attacking Few-Shot Classifiers with Adversarial Support Poisoning

Elre T. Oldewage, John Bronskill, Richard E. Turner, 2021. (In A Blessing in Disguise: The Prospects and Perils of Adversarial Machine Learning, Workshop at ICML 2021).

Abstract▼ URL

This paper examines the robustness of deployed few-shot meta-learning systems when they are fed an imperceptibly perturbed few-shot dataset, showing that the resulting predictions on test inputs can become worse than chance. This is achieved by developing a novel attack, Adversarial Support Poisoning or ASP, which crafts a poisoned set of examples. When even a small subset of malicious data points is inserted into the support set of a meta-learner, accuracy is significantly reduced. We evaluate the new attack on a variety of few-shot classification algorithms and scenarios, and propose a form of adversarial training that significantly improves robustness against both poisoning and evasion attacks.

#### An Algorithmic Framework for Positive Action

Oliver Thomas, Miri Zilka, Adrian Weller, Novi Quadrianto, 2021. (Equity and Access in Algorithms, Mechanisms, and Optimization).

Abstract▼ URL

Positive action is defined within anti-discrimination legislation as voluntary, legal action taken to address an imbalance of opportunity affecting individuals belonging to under-represented groups. Within this theme, we propose a novel algorithmic fairness framework to advance equal representation while respecting anti-discrimination legislation and equal-treatment rights. We use a counterfactual fairness approach to assign one of three outcomes to each candidate: accept; reject; or flagged as a positive action candidate.

#### Ensembling geophysical models with Bayesian Neural Networks

Ushnish Sengupta, Matt Amos, J. Scott Hosking, Carl Edward Rasmussen, Matthew P. Juniper, Paul J. Young, 2021. (In Advances in Neural Information Processing Systems 34).

Abstract▼ URL

Ensembles of geophysical models improve projection accuracy and express uncertainties. We develop a novel data-driven ensembling strategy for combining geophysical models using Bayesian Neural Networks, which infers spatiotemporally varying model weights and bias while accounting for heteroscedastic uncertainties in the observations. This produces more accurate and uncertainty-aware projections without sacrificing interpretability. Applied to the prediction of total column ozone from an ensemble of 15 chemistry-climate models, we find that the Bayesian neural network ensemble (BayNNE) outperforms existing ensembling methods, achieving a 49.4% reduction in RMSE for temporal extrapolation, and a 67.4% reduction in RMSE for polar data voids, compared to a weighted mean. Uncertainty is also well-characterized, with 90.6% of the data points in our extrapolation validation dataset lying within 2 standard deviations and 98.5% within 3 standard deviations.

#### Kernel Identification Through Transformers

Fergus Simpson, Ian Davies, Vidhi Lalchand, Alessandro Vullo, Nicolas Durrande, Carl Edward Rasmussen, 2021. (In Advances in Neural Information Processing Systems 34).

Abstract▼ URL

Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models, as the chosen kernel determines both the inductive biases and prior support of functions under the GP prior. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. Drawing inspiration from recent progress in deep learning, we introduce a novel approach named KITT: Kernel Identification Through Transformers. KITT exploits a transformer-based architecture to generate kernel recommendations in under 0.1 seconds, which is several orders of magnitude faster than conventional kernel search algorithms. We train our model using synthetic data generated from priors over a vocabulary of known kernels. By exploiting the nature of the self-attention mechanism, KITT is able to process datasets with inputs of arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong performance over a diverse collection of regression benchmarks.

#### Marginalised Gaussian Processes with Nested Sampling

Fergus Simpson, Vidhi Lalchand, Carl Edward Rasmussen, 2021. (In Advances in Neural Information Processing Systems 34). Curran Associates, Inc..

Abstract▼ URL

Gaussian Process models are a rich distribution over functions with inductive biases controlled by a kernel function. Learning occurs through optimisation of the kernel hyperparameters using the marginal likelihood as the objective. This work proposes nested sampling as a means of marginalising kernel hyperparameters, because it is a technique that is well-suited to exploring complex, multi-modal distributions. We benchmark against Hamiltonian Monte Carlo on time-series and two-dimensional regression tasks, finding that a principled approach to quantifying hyperparameter uncertainty substantially improves the quality of prediction intervals.

#### Fs-mol: A few-shot learning dataset of molecules

Megan Stanley, John Bronskill, Krzysztof Maziarz, Hubert Misztela, Jessica Lanini, Marwin Segler, Nadine Schneider, Marc Brockschmidt, 2021. (In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)).

Abstract▼ URL

Small datasets are ubiquitous in drug discovery as data generation is expensive and can be restricted for ethical reasons (eg in vivo experiments). A widely applied technique in early drug discovery to identify novel active molecules against a protein target is modelling quantitative structure-activity relationships (QSAR). It is known to be extremely challenging, as available measurements of compound activities range in the low dozens or hundreds. However, many such related datasets exist, each with a small number of datapoints, opening up the opportunity for few-shot learning after pre-training on a substantially larger corpus of data. At the same time, many few-shot learning methods are currently evaluated in the computer-vision domain. We propose that expansion into a new application, as well as the possibility to use explicitly graph-structured data, will drive exciting progress in few-shot learning. Here, we provide a few-shot learning dataset (FS-Mol) and complementary benchmarking procedure. We define a set of tasks on which few-shot learning methods can be evaluated, with a separate set of tasks for use in pre-training. In addition, we implement and evaluate a number of existing single-task, multi-task, and meta-learning approaches as baselines for the community. We hope that our dataset, support code release, and baselines will encourage future work on this extremely challenging new domain for few-shot learning.

#### Unsupervised Object Learning via Common Fate

M. Tangemann, S. Schneider, J. von Kügelgen, F. Locatello, P. Gehler, T. Brox, M. Kümmerer, M. Bethge, B. Schölkopf, 2021.

Abstract▼ URL

Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling. We decompose this problem into three easier subtasks, and provide candidate solutions for each of them. Inspired by the Common Fate Principle of Gestalt Psychology, we first extract (noisy) masks of moving objects via unsupervised motion segmentation. Second, generative models are trained on the masks of the background and the moving objects, respectively. Third, background and foreground models are combined in a conditional “dead leaves” scene model to sample novel scene configurations where occlusions and depth layering arise naturally. To evaluate the individual stages, we introduce the Fishbowl dataset positioned between complex real-world scenes and common object-centric benchmarks of simplistic objects. We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos, and represent scenes in a modular fashion that allows sampling plausible scenes outside the training distribution by permitting, for instance, object numbers or densities not observed in the training set.

#### Combining pseudo-point and state space approximations for sum-separable Gaussian Processes

Will Tebbutt, Arno Solin, Richard E. Turner, 2021. (In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence). Edited by Cassio de Campos, Marloes H. Maathuis. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Gaussian processes (GPs) are important probabilistic tools for inference and learning in spatio-temporal modelling problems such as those in climate science and epidemiology. However, existing GP approximations do not simultaneously support large numbers of off-the-grid spatial data-points and long time-series which is a hallmark of many applications. Pseudo-point approximations, one of the gold-standard methods for scaling GPs to large data sets, are well suited for handling off-the-grid spatial data. However, they cannot handle long temporal observation horizons effectively reverting to cubic computational scaling in the time dimension. State space GP approximations are well suited to handling temporal data, if the temporal GP prior admits a Markov form, leading to linear complexity in the number of temporal observations, but have a cubic spatial cost and cannot handle off-the-grid spatial data. In this work we show that there is a simple and elegant way to combine pseudo-point methods with the state space GP approximation framework to get the best of both worlds. The approach hinges on a surprising conditional independence property which applies to space–time separable GPs. We demonstrate empirically that the combined approach is more scalable and applicable to a greater range of spatio-temporal problems than either method on its own.

#### Backward-Compatible Prediction Updates: A Probabilistic Approach

F. Träuble, J. von Kügelgen, M. Kleindessner, F. Locatello, B. Schölkopf, P. Gehler, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

When machine learning systems meet real world applications, accuracy is only one of several requirements. In this paper, we assay a complementary perspective originating from the increasing availability of pre-trained and regularly improving state-of-the-art models. While new improved models develop at a fast pace, downstream tasks vary more slowly or stay constant. Assume that we have a large unlabelled data set for which we want to maintain accurate predictions. Whenever a new and presumably better ML models becomes available, we encounter two problems: (i) given a limited budget, which data points should be re-evaluated using the new model?; and (ii) if the new predictions differ from the current ones, should we update? Problem (i) is about compute cost, which matters for very large data sets and models. Problem (ii) is about maintaining consistency of the predictions, which can be highly relevant for downstream applications; our demand is to avoid negative flips, i.e., changing correct to incorrect predictions. In this paper, we formalize the Prediction Update Problem and present an efficient probabilistic approach as answer to the above questions. In extensive experiments on standard classification benchmark data sets, we show that our method outperforms alternative strategies along key metrics for backward-compatible prediction updates.

#### Iterative teaching by label synthesis

Weiyang Liu, Zhen Liu, Hanchen Wang, Liam Paull, Bernhard Schölkopf, Adrian Weller, 2021. (Advances in Neural Information Processing Systems).

#### Learning with hyperspherical uniformity

Weiyang Liu, Rongmei Lin, Zhen Liu, Li Xiong, Bernhard Schölkopf, Adrian Weller, 2021. (In International Conference On Artificial Intelligence and Statistics).

#### Orthogonal over-parameterized training

Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Liam Paull, Li Xiong, Le Song, Adrian Weller, 2021. (In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition).

#### Self-supervised 3d face reconstruction via conditional estimation

Yandong Wen, Weiyang Liu, Bhiksha Raj, Rita Singh, 2021. (In Proceedings of the IEEE/CVF International Conference on Computer Vision).

#### Couplings for Multinomial Hamiltonian Monte Carlo

Kai Xu, Tor Erlend Fjelde, Charles Sutton, Hong Ge, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time – the time it takes for the coupled chains to meet – based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods.

#### Locality sensitive teaching

Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, Anshumali Shrivastava, 2021. (Advances in Neural Information Processing Systems).

## 2020

#### Depth Uncertainty in Neural Networks

Javier Antorán, James Urquhart Allingham, José Miguel Hernández-Lobato, 2020. (In Advances in Neural Information Processing Systems 33). Edited by Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, Hsuan-Tien Lin.

Abstract▼ URL

Existing methods for estimating uncertainty in deep learning tend to require multiple forward passes, making them unsuitable for applications where computational resources are limited. To solve this, we perform probabilistic reasoning over the depth of neural networks. Different depths correspond to subnetworks which share weights and whose predictions are combined via marginalisation, yielding model uncertainty. By exploiting the sequential structure of feed-forward networks, we are able to both evaluate our training objective and make predictions with a single forward pass. We validate our approach on real-world regression and image classification tasks. Our approach provides uncertainty calibration, robustness to dataset shift, and accuracies competitive with more computationally expensive baselines.

**Comment:** Code

#### Sparse Gaussian process variational autoencoders

Matthew Ashman, Jonny So, Will Tebbutt, Vincent Fortuin, Michael Pearce, Richard E. Turner, 2020.

Abstract▼ URL

Large, multi-dimensional spatio-temporal datasets are omnipresent in modern science and engineering. An effective framework for handling such data are Gaussian process deep generative models (GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing approaches for performing inference in GP-DGMs do not support sparse GP approximations based on inducing points, which are essential for the computational efficiency of GPs, nor do they handle missing data – a natural occurrence in many spatio-temporal datasets – in a principled manner. We address these shortcomings with the development of the sparse Gaussian process variational autoencoder (SGP-VAE), characterised by the use of partial inference networks for parameterising sparse GP approximations. Leveraging the benefits of amortised variational inference, the SGP-VAE enables inference in multi-output sparse GPs on previously unobserved data with no additional training. The SGP-VAE is evaluated in a variety of experiments where it outperforms alternative approaches including multi-output GPs and structured VAEs.

#### Converting to Optimization in Machine Learning: Perturb-and-MAP, Differential Privacy, and Program Synthesis

Matej Balog, 2020. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

On a mathematical level, most computational problems encountered in machine learning are instances of one of four abstract, fundamental problems: sampling, integration, optimization, and search. Thanks to the rich history of the respective mathematical fields, disparate methods with different properties have been developed for these four problem classes. As a result it can be beneficial to convert a problem from one abstract class into a problem of a different class, because the latter might come with insights, techniques, and algorithms well suited to the particular problem at hand. In particular, this thesis contributes four new methods and generalizations of existing methods for converting specific non-optimization machine learning tasks into optimization problems with more appealing properties. The first example is partition function estimation (an integration problem), where an existing algorithm – the Gumbel trick – for converting to the MAP optimization problem is generalized into a more general family of algorithms, such that other instances of this family have better statistical properties. Second, this family of algorithms is further generalized to another integration problem, the problem of estimating Rényi entropies. The third example shows how an intractable sampling problem arising when wishing to publicly release a database containing sensitive data in a safe (“differentially private”) manner can be converted into an optimization problem using the theory of Reproducing Kernel Hilbert Spaces. Finally, the fourth case study casts the challenging discrete search problem of program synthesis from input-output examples as a supervised learning task that can be efficiently tackled using gradient-based optimization. In all four instances, the conversions result in novel algorithms with desirable properties. In the first instance, new generalizations of the Gumbel trick can be used to construct statistical estimators of the partition function that achieve the same estimation error while using up to 40% fewer samples. The second instance shows that unbiased estimators of the Rényi entropy can be constructed in the Perturb-and-MAP framework. The main contribution of the third instance is theoretical: the conversion shows that it is possible to construct an algorithm for releasing synthetic databases that approximate databases containing sensitive data in a mathematically precise sense, and to prove results about their approximation errors. Finally, the fourth conversion yields an algorithm for synthesising program source code from input-output examples that is able to solve test problems 1-3 orders of magnitude faster than a wide range of baselines.

#### Neural program synthesis with a differentiable fixer

Matej Balog, Rishabh Singh, Petros Maniatis, Charles Sutton, 2020. (arXiv).

Abstract▼ URL

We present a new program synthesis approach that combines an encoder-decoder based synthesis architecture with a differentiable program fixer. Our approach is inspired from the fact that human developers seldom get their program correct on the first attempt, and perform iterative testing-based program fixing to get to the desired program functionality. Similarly, our approach first learns a distribution over programs conditioned on an encoding of a set of input-output examples, and then iteratively performs fix operations using the differentiable fixer. The fixer takes as input the original examples and the current program’s outputs on example inputs, and generates a new distribution over the programs with the goal of reducing the discrepancies between the current program outputs and the desired example outputs. We train our architecture end-to-end on the RobustFill domain, and show that the addition of the fixer module leads to a significant improvement on synthesis accuracy compared to using beam search.

#### Evaluating and Aggregating Feature-based Model Explanations

Umang Bhatt, Adrian Weller, Jose M. F. Moura, 2020. (In International Joint Conference on Artificial Intelligence).

Abstract▼ URL

A feature-based model explanation denotes how much each input feature contributes to a model’s output for a given data point. As the number of proposed explanation functions grows, we lack quantitative evaluation criteria to help practitioners know when to use which explanation function. This paper proposes quantitative evaluation criteria for feature-based explanations: low sensitivity, high faithfulness, and low complexity. We devise a framework for aggregating explanation functions. We develop a procedure for learning an aggregate explanation function with lower complexity and then derive a new aggregate Shapley value explanation function that minimizes sensitivity.

#### Explainable Machine Learning in Deployment

Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José M. F. Moura, Peter Eckersley, 2020. (In ACM Conference on Fairness, Accountability, and Transparency (FAT*)).

Abstract▼ URL

Explainable machine learning offers the potential to provide stakeholders with insights into model behavior by using various methods such as feature importance scores, counterfactual explanations, or influential training data. Yet there is little understanding of how organizations use these methods in practice. This study explores how organizations view and use explainability for stakeholder consumption. We find that, currently, the majority of deployments are not for end users affected by the model but rather for machine learning engineers, who use explainability to debug the model itself. There is thus a gap between explainability in practice and the goal of transparency, since explanations primarily serve internal stakeholders rather than external ones. Our study synthesizes the limitations of current explainability techniques that hamper their use for end users. To facilitate end user interaction, we develop a framework for establishing clear goals for explainability. We end by discussing concerns raised regarding explainability.

#### Data and computation efficient meta-learning

John Bronskill, November 2020. University of Cambridge, Cambridge, UK.

Abstract▼ URL

In order to make predictions with high accuracy, conventional deep learning systems require large training datasets consisting of thousands or millions of examples and long training times measured in hours or days, consuming high levels of electricity with a negative impact on our environment. It is desirable to have have machine learning systems that can emulate human behavior such that they can quickly learn new concepts from only a few examples. This is especially true if we need to quickly customize or personalize machine learning models to specific scenarios where it would be impractical to acquire a large amount of training data and where a mobile device is the means for computation. We define a data efficient machine learning system to be one that can learn a new concept from only a few examples (or shots) and a computation efficient machine learning system to be one that can learn a new concept rapidly without retraining on an everyday computing device such as a smart phone. In this work, we design, develop, analyze, and extend the theory of machine learning systems that are both data efficient and computation efficient. We present systems that are trained using multiple tasks such that it “learns how to learn” to solve new tasks from only a few examples. These systems can efficiently solve new, unseen tasks drawn from a broad range of data distributions, in both the low and high data regimes, without the need for costly retraining. Adapting to a new task requires only a forward pass of the example task data through the trained network making the learning of new tasks possible on mobile devices. In particular, we focus on few-shot image classification systems, i.e. machine learning systems that can distinguish between numerous classes of objects depicted in digital images given only a few examples of each class of object to learn from.

#### TaskNorm: rethinking batch normalization for meta-learning

John Bronskill, Jonathan Gordon, James Requeima, Sebastian Nowozin, Richard E. Turner, 2020. (In 37th International Conference on Machine Learning). Proceedings of Machine Learning Research.

Abstract▼ URL

Modern meta-learning approaches for image classification rely on increasingly deep networks to achieve state-of-the-art performance, making batch normalization an essential component of meta-learning pipelines. However, the hierarchical nature of the meta-learning setting presents several challenges that can render conventional batch normalization ineffective, giving rise to the need to rethink normalization in this setting. We evaluate a range of approaches to batch normalization for meta-learning scenarios, and develop a novel approach that we call TASKNORM. Experiments on fourteen datasets demonstrate that the choice of batch normalization has a dramatic effect on both classification accuracy and training time for both gradient based and gradient free meta-learning approaches. Importantly, TASKNORM is found to consistently improve performance. Finally, we provide a set of best practices for normalization that will allow fair comparison of meta-learning algorithms.

#### Scalable Exact Inference in Multi-Output Gaussian Processes

Wessel Bruinsma, Eric Perim, Will Tebbutt, J. Scott Hosking, Arno Solin, Richard E. Turner, 2020. (In 37th International Conference on Machine Learning). Proceedings of Machine Learning Research.

Abstract▼ URL

Multi-output Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatio-temporal modelling. The key problem with MOGPs is their computational scaling O(n^3 p^3), which is cubic in the number of both inputs n (e.g., time points or locations) and outputs p. For this reason, a popular class of MOGPs assumes that the data live around a low-dimensional linear subspace, reducing the complexity to O(n^3 m^3). However, this cost is still cubic in the dimensionality of the subspace m, which is still prohibitively expensive for many applications. We propose the use of a sufficient statistic of the data to accelerate inference and learning in MOGPs with orthogonal bases. The method achieves linear scaling in m in practice, allowing these models to scale to large m without sacrificing significant expressivity or requiring approximation. This advance opens up a wide range of real-world tasks and can be combined with existing GP approximations in a plug-and-play way. We demonstrate the efficacy of the method on various synthetic and real-world data sets.

#### Convergence of Sparse Variational Inference in Gaussian Processes Regression

David R. Burt, Carl Edward Rasmussen, Mark van der Wilk, 2020. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, N, due to the cubic (in N) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on M 2). While the computational cost appears linear in N, the true complexity depends on how M must scale with N to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how M needs to grow with N to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with M D) suffice and a method with an overall computational cost of O(N(log N)2D(log log N)2) can be used to perform inference.

#### Lazily Adapted Constant Kinky Inference for non-parametric regression and model-reference adaptive control

Jan-Peter Calliess, Stephen J. Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2020. (Automatica). **DOI**: 10.1016/j.automatica.2020.109216.

Abstract▼

Techniques known as Nonlinear Set Membership prediction or Lipschitz Interpolation are approaches to supervised machine learning that utilise presupposed Lipschitz properties to perform inference over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori, they offer convergence guarantees, as well as bounds around the predictions. Considering a more general setting that builds on Lipschitz continuity, we propose an online method for estimating the Lipschitz constant online from function value observations that are possibly corrupted by bounded noise. Utilising this as a data-dependent hyper-parameter gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function on compact support in the limit of increasingly dense data, up to a worst-case error that can be bounded by the level of observational error. We also consider applications of our nonparametric regression method to learning-based control. For a class of discrete-time settings, we establish convergence guarantees on the closed-loop tracking error of our online learning-based controllers. To provide evidence that our method can be beneficial not only in theory but also in practice, we apply it in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks.

#### Stochastic Flows and Geometric Optimization on the Orthogonal Group

Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani, 2020. (In 37th International Conference on Machine Learning).

Abstract▼ URL

We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult Humanoid agent from OpenAI Gym and improving convolutional neural networks.

#### You shouldn't trust me: Learning models which conceal unfairness from multiple explanation methods

Botty Dimanov, Umang Bhatt, Mateja Jamnik, Adrian Weller, 2020. (In European Conference on Artificial Intelligence (ECAI)).

Abstract▼ URL

Transparency of algorithmic systems has been discussed as a way for end-users and regulators to develop appropriate trust in machine learning models. One popular approach, LIME [26], even suggests that model explanations can answer the question “Why should I trust you?” Here we show a straightforward method for modifying a pre-trained model to manipulate the output of many popular feature importance explanation methods with little change in accuracy, thus demonstrating the danger of trusting such explanation methods. We show how this explanation attack can mask a model’s discriminatory use of a sensitive feature, raising strong concerns about using such explanation methods to check model fairness.

#### Sparse Gaussian Processes with Spherical Harmonic Features

Vincent Dutordoir, Nicolas Durrande, James Hensman, June 2020. (In 37th International Conference on Machine Learning). Online.

Abstract▼ URL

We introduce a new class of inter-domain variational Gaussian processes (GP) where data is mapped onto the unit hypersphere in order to use spherical harmonic representations. Our inference scheme is comparable to variational Fourier features, but it does not suffer from the curse of dimensionality, and leads to diagonal covariance matrices between inducing variables. This enables a speed-up in inference, because it bypasses the need to invert large covariance matrices. Our experiments show that our model is able to fit a regression model for a dataset with 6 million entries two orders of magnitude faster compared to standard sparse GPs, while retaining state of the art accuracy. We also demonstrate competitive performance on classification with non-conjugate likelihoods.

#### Meta-Learning Stationary Stochastic Process Prediction With Convolutional Neural Processes

Andrew Y. K. Foong, Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, James Requeima, Richard E. Turner, 2020. (In Advances in Neural Information Processing Systems 33). Curran Associates, Inc..

Abstract▼ URL

Stationary stochastic processes (SPs) are a key component of many probabilistic models, such as those for off-the-grid spatio-temporal data. They enable the statistical symmetry of underlying physical phenomena to be leveraged, thereby aiding generalization. Prediction in such models can be viewed as a translation equivariant map from observed data sets to predictive SPs, emphasizing the intimate relationship between stationarity and equivariance. Building on this, we propose the Convolutional Neural Process (ConvNP), which endows Neural Processes (NPs) with translation equivariance and extends convolutional conditional NPs to allow for dependencies in the predictive distribution. The latter enables ConvNPs to be deployed in settings which require coherent samples, such as Thompson sampling or conditional image completion. Moreover, we propose a new maximum-likelihood objective to replace the standard ELBO objective in NPs, which conceptually simplifies the framework and empirically improves performance. We demonstrate the strong performance and generalization capabilities of ConvNPs on 1D regression, image completion, and various tasks with real-world spatio-temporal data.

#### On the Expressiveness of Approximate Inference in Bayesian Neural Networks

Andrew Foong, David Burt, Yingzhen Li, Richard Turner, 2020. (In Advances in Neural Information Processing Systems 34).

#### Convolutional Conditional Neural Processes

Jonathan Gordon, Wessel Bruinsma, Andrew Y. K. Foong, James Requeima, Yann Dubois, Richard Turner, April 2020. (In 8th International Conference on Learning Representations). Adis Ababa.

Abstract▼ URL

We introduce the Convolutional Conditional Neural Process (ConvCNP), a new member of the Neural Process family that models translation equivariance in the data. Translation equivariance is an important inductive bias for many learning problems including time series modelling, spatial data, and images. The model embeds data sets into an infinite-dimensional function space, as opposed to finite-dimensional vector spaces. To formalize this notion, we extend the theory of neural representations of sets to include functional representations, and demonstrate that any translation-equivariant embedding can be represented using a convolutional deep-set. We evaluate ConvCNPs in several settings, demonstrating that they achieve state-of-the-art performance compared to existing NPs. We demonstrate that building in translation equivariance enables zero-shot generalization to challenging, out-of-domain tasks.

#### Exact posteriors of wide Bayesian neural networks

Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, 2020. (UDL (ICML workshop)).

Abstract▼ URL

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling.

#### Infinite attention: NNGP and NTK for deep attention networks

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak, 2020. (ICML).

Abstract▼ URL

There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.

#### Exploration in two-stage recommender systems

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, 2020. (REVEAL (ACM RecSys workshop)).

Abstract▼ URL

Two-stage recommender systems are widely adopted in industry due to their scalability and maintainability. These systems produce recommendations in two steps: (i) multiple nominators preselect a small number of items from a large pool using cheap-to-compute item embeddings; (ii) with a richer set of features, a ranker rearranges the nominated items and serves them to the user. A key challenge of this setup is that optimal performance of each stage in isolation does not imply optimal global performance. In response to this issue, Ma et al. (2020) proposed a nominator training objective importance weighted by the ranker’s probability of recommending each item. In this work, we focus on the complementary issue of exploration. Modeled as a contextual bandit problem, we find LinUCB (a near optimal exploration strategy for single-stage systems) may lead to linear regret when deployed in two-stage recommenders. We therefore propose a method of synchronising the exploration strategies between the ranker and the nominators. Our algorithm only relies on quantities already computed by standard LinUCB at each stage and can be implemented in three lines of additional code. We end by demonstrating the effectiveness of our algorithm experimentally.

#### Bandit optimisation of functions in the Matérn kernel RKHS

David Janz, David Burt, Javier Gonzalez, 2020. (In 23rd International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

We consider the problem of optimising functions in the reproducing kernel Hilbert space (RKHS) of a Matérn kernel with smoothness parameter u over the domain [0,1]^d under noisy bandit feedback. Our contribution, the π-GP-UCB algorithm, is the first practical approach with guaranteed sublinear regret for all u gt;1 and d ≥ 1. Empirical validation suggests better performance and drastically improved computational scalablity compared with its predecessor, Improved GP-UCB.

#### Algorithmic recourse under imperfect causal knowledge: a probabilistic approach

A.-H. Karimi*, J. von Kügelgen*, B. Schölkopf, I. Valera, 2020. (In Advances in Neural Information Processing Systems 33). Edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin. Curran Associates, Inc.. **Note**: *equal contribution.

Abstract▼ URL

Recent work has discussed the limitations of counterfactual explanations to recommend actions for algorithmic recourse, and argued for the need of taking causal relationships between features into consideration. Unfortunately, in practice, the true underlying structural causal model is generally unknown. In this work, we first show that it is impossible to guarantee recourse without access to the true structural equations. To address this limitation, we propose two probabilistic approaches to select optimal actions that achieve recourse with high probability given limited causal knowledge (e.g., only the causal graph). The first captures uncertainty over structural equations under additive Gaussian noise, and uses Bayesian model averaging to estimate the counterfactual distribution. The second removes any assumptions on the structural equations by instead computing the average effect of recourse actions on individuals similar to the person who seeks recourse, leading to a novel subpopulation-based interventional notion of recourse. We then derive a gradient-based procedure for selecting optimal recourse actions, and empirically show that the proposed approaches lead to more reliable recommendations under imperfect causal knowledge than non-probabilistic baselines.

#### Fair Decisions Despite Imperfect Predictions

Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet, Isabel Valera, 26–28 Aug 2020. (In 23rd International Conference on Artificial Intelligence and Statistics). Edited by Silvia Chiappa, Roberto Calandra. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Consequential decisions are increasingly informed by sophisticated data-driven predictive models. However, consistently learning accurate predictive models requires access to ground truth labels. Unfortunately, in practice, labels may only exist conditional on certain decisions—if a loan is denied, there is not even an option for the individual to pay back the loan. In this paper, we show that, in this selective labels setting, learning to predict is suboptimal in terms of both fairness and utility. To avoid this undesirable behavior, we propose to directly learn stochastic decision policies that maximize utility under fairness constraints. In the context of fair machine learning, our results suggest the need for a paradigm shift from “learning to predict” to “learning to decide”. Experiments on synthetic and real-world data illustrate the favorable properties of learning to decide, in terms of both utility and fairness.

#### Semi-supervised learning, causality, and the conditional cluster assumption

J. von Kügelgen, A. Mey, M. Loog, B. Schölkopf, 2020. (In Proceedings of the 36th International Conference on Uncertainty in Artificial Intelligence (UAI)). Edited by Jonas Peters, David Sontag. PMLR. Proceedings of Machine Learning Research. **Note**: *also at NeurIPS 2019 Workshop Do the right thing: machine learning and causal inference for improved decision making.

Abstract▼ URL

While the success of semi-supervised learning (SSL) is still not fully understood, Schölkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target variable from its causes, but possible when predicting it from its effects. Since both these cases are restrictive, we extend their work by considering classification using cause and effect features at the same time, such as predicting a disease from both risk factors and symptoms. While standard SSL exploits information contained in the marginal distribution of all inputs (to improve the estimate of the conditional distribution of the target given in-puts), we argue that in our more general setting we should use information in the conditional distribution of effect features given causal features. We explore how this insight generalises the previous understanding, and how it relates to and can be exploited algorithmically for SSL.

#### Towards causal generative scene models via competition of experts

J. von Kügelgen*, I. Ustyuzhaninov*, P. Gehler, M. Bethge, B. Schölkopf, 2020. (In ICLR 2020 Workshop "Causal Learning for Decision Making"). **Note**: *equal contribution.

Abstract▼ URL

Learning how to model complex scenes in a modular way with recombinable components is a pre-requisite for higher-order reasoning and acting in the physical world. However, current generative models lack the ability to capture the inherently compositional and layered nature of visual scenes. While recent work has made progress towards unsupervised learning of object-based scene representations, most models still maintain a global representation space (i.e., objects are not explicitly separated), and cannot generate scenes with novel object arrangement and depth ordering. Here, we present an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts). During training, experts compete for explaining parts of a scene, and thus specialise on different object classes, with objects being identified as parts that re-occur across multiple scenes. Our model allows for controllable sampling of individual objects and recombination of experts in physically plausible ways. In contrast to other methods, depth layering and occlusion are handled correctly, moving this approach closer to a causal generative scene model. Experiments on simple toy data qualitatively demonstrate the conceptual advantages of the proposed approach.

#### Approximate inference for Fully Bayesian Gaussian process Regression

Vidhi Lalchand, Carl Edward Rasmussen, 2020. (In 2nd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

Learning in Gaussian Process models occurs through the adaptation of hyperparameters of the mean and the covariance function. The classical approach entails maximizing the marginal likelihood yielding fixed point estimates (an approach called Type II maximum likelihood or ML-II). An alternative learning procedure is to infer the posterior over hyper-parameters in a hierarchical specication of GPs we call Fully Bayesian Gaussian Process Regression (GPR). This work considers two approximation schemes for the intractable hyperparameter posterior: 1) Hamiltonian Monte Carlo (HMC) yielding a sampling based approximation and 2) Variational Inference (VI) where the posterior over hyperparameters is approximated by a factorized Gaussian (mean-field) or a full-rank Gaussian accounting for correlations between hyperparameters. We analyse the predictive performance for fully Bayesian GPR on a range of benchmark data sets.

#### Neural Tangents: fast and easy infinite networks in Python

Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Jascha Sohl-Dickstein, Samuel Schoenholz, 2020. (ICLR).

Abstract▼ URL

Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices.

#### Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, Zoubin Ghahramani, July 2020. (In 37th International Conference on Machine Learning). Online.

Abstract▼ URL

Probabilistic circuits (PCs) are a promising avenue for probabilistic modeling, as they permit a wide range of exact and efficient inference routines. Recent “deep-learning-style” implementations of PCs strive for a better scalability, but are still difficult to train on real-world data, due to their sparsely connected computational graphs. In this paper, we propose Einsum Networks (EiNets), a novel implementation design for PCs, improving prior art in several regards. At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation, leading to speedups and memory savings of up to two orders of magnitude, in comparison to previous implementations. As an algorithmic contribution, we show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation. Furthermore, we demonstrate that EiNets scale well to datasets which were previously out of reach, such as SVHN and CelebA, and that they can be used as faithful generative image models.

#### Deep Structured Mixtures of Gaussian Processes

Martin Trapp, Robert Peharz, Franz Pernkopf, Carl Edward Rasmussen, August 2020. (In 23rd International Conference on Artificial Intelligence and Statistics). Online.

Abstract▼ URL

Gaussian Processes (GPs) are powerful non-parametric Bayesian regression models that allow exact posterior inference, but exhibit high computational and memory costs. In order to improve scalability of GPs, approximate posterior inference is frequently employed, where a prominent class of approximation techniques is based on local GP experts. However, local-expert techniques proposed so far are either not well-principled, come with limited approximation guarantees, or lead to intractable models. In this paper, we introduce deep structured mixtures of GP experts, a stochastic process model which i) allows exact posterior inference, ii) has attractive computational and memory costs, and iii) when used as GP approximation, captures predictive uncertainties consistently better than previous expert-based approximations. In a variety of experiments, we show that deep structured mixtures have a low approximation error and often perform competitive or outperform prior work.

#### To Ensemble or Not Ensemble: When Does End-to-End Training Fail?

Andrew Webb, Charles Reynolds, Wenlin Chen, Henry Reeve, Dan Iliescu, Mikel Luján, Gavin Brown, 2020. (In European Conference on Machine Learning (ECML)).

Abstract▼ URL

End-to-End training (E2E) is becoming more and more popular to train complex Deep Network architectures. An interesting question is whether this trend will continue—are there any clear failure cases for E2E training? We study this question in depth, for the specific case of E2E training an ensemble of networks. Our strategy is to blend the gradient smoothly in between two extremes: from independent training of the networks, up to to full E2E training. We find clear failure cases, where overparameterized models cannot be trained E2E. A surprising result is that the optimum can sometimes lie in between the two, neither an ensemble or an E2E system. The work also uncovers links to Dropout, and raises questions around the nature of ensemble diversity and multi-branch networks.

**Comment:** arXiv

#### Inferring the effectiveness of government interventions against COVID-19

Jan M Brauner, Sören Mindermann, Mrinank Sharma, David Johnston, John Salvatier, Tomáš Gavenčiak, Anna B Stephenson, Gavin Leech, George Altman, Vladimir Mikulik, Alexander John Norman, Joshua Teperowski Monrad, Tamay Besiroglu, Hong Ge, Meghan A Hartwick, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal, Jan Kulveit, December 2020. (Science).

## 2019

#### One-network Adversarial Fairness

Tameem Adel, Isabel Valera, Zoubin Ghahramani, Adrian Weller, January 2019. (In 33rd AAAI Conference on Artificial Intelligence). Hawaii.

Abstract▼ URL

There is currently a great expansion of the impact of machine learning algorithms on our lives, prompting the need for objectives other than pure performance, including fairness. Fairness here means that the outcome of an automated decision-making system should not discriminate between subgroups characterized by sensitive attributes such as gender or race. Given any existing differentiable classifier, we make only slight adjustments to the architecture including adding a new hidden layer, in order to enable the concurrent adversarial optimization for fairness and accuracy. Our framework provides one way to quantify the tradeoff between fairness and accuracy, while also leading to strong empirical performance.

#### TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning

Tameem Adel, Adrian Weller, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

One of the challenges to reinforcement learning (RL) is scalable transferability among complex tasks. Incorporating a graphical model (GM), along with the rich family of related methods, as a basis for RL frameworks provides potential to address issues such as transferability, generalisation and exploration. Here we propose a flexible GM-based RL framework which leverages efficient inference procedures to enhance generalisation and transfer power. In our proposed transferable and information-based graphical model framework ‘TibGM’, we show the equivalence between our mutual information-based objective in the GM, and an RL consolidated objective consisting of a standard reward maximisation target and a generalisation/transfer objective. In settings where there is a sparse or deceptive reward signal, our TibGM framework is flexible enough to incorporate exploration bonuses depicting intrinsic rewards. We empirically verify improved performance and exploration power.

#### Fast training of sparse graph neural networks on dense hardware

Matej Balog, Bart van Merriënboer, Subhodeep Moitra, Yujia Li, Daniel Tarlow, 2019. (arXiv).

Abstract▼ URL

Graph neural networks have become increasingly popular in recent years due to their ability to naturally encode relational input data and their ability to scale to large graphs by operating on a sparse representation of graph adjacency matrices. As we look to scale up these models using custom hardware, a natural assumption would be that we need hardware tailored to sparse operations and/or dynamic control flow. In this work, we question this assumption by scaling up sparse graph neural networks using a platform targeted at dense computation on fixed-size data. Drawing inspiration from optimization of numerical algorithms on sparse matrices, we develop techniques that enable training the sparse graph neural network model from Allamanis et al. 2018 in 13 minutes using a 512-core TPUv2 Pod, whereas the original training takes almost a day.

#### Rates of Convergence for Sparse Variational Gaussian Process Regression

David R Burt, Carl Edward Rasmussen, Mark van der Wilk, 2019. (arXiv).

Abstract▼ URL

Excellent variational approximations to Gaussian process posteriors have been developed which avoid the O(N3) scaling with dataset size N. They reduce the computational cost to O(NM2), with M ≪N being the number of inducing variables, which summarise the process. While the computational cost seems to be linear in N, the true complexity of the algorithm depends on how M must increase to ensure a certain quality of approximation. We address this by characterising the behavior of an upper bound on the KL divergence to the posterior. We show that with high probability the KL divergence can be made arbitrarily small by growing M more slowly than N. A particular case of interest is that for regression with normally distributed inputs in D-dimensions with the popular Squared Exponential kernel, M=O(logDN) is sufficient. Our results show that as datasets grow, Gaussian process posteriors can truly be approximated cheaply, and provide a concrete rule for how to increase M in continual learning scenarios.

#### Motivations and Risks of Machine Ethics

Stephen Cave, Rune Nyrup, Karina Vold, Adrian Weller, 2019. (Proceedings of the IEEE).

Abstract▼ URL

This paper surveys reasons for and against pursuing the field of machine ethics, understood as research aiming to build “ethical machines.” We clarify the nature of this goal, why it is worth pursuing, and the risks involved in its pursuit. First, we survey and clarify some of the philosophical issues surrounding the concept of an “ethical machine” and the aims of machine ethics. Second, we argue that while there are good prima facie reasons for pursuing machine ethics, including the potential to improve the ethical alignment of both humans and machines, there are also potential risks that must be considered. Third, we survey these potential risks and point to where research should be devoted to clarifying and managing potential risks. We conclude by making some recommendations about the questions that future work could address.

#### Unifying Orthogonal Monte Carlo Methods

Krzysztof Choromanski, Mark Rowland, Wenyu Chen, Adrian Weller, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

Many machine learning methods making use of Monte Carlo sampling in vector spaces have been shown to be improved by conditioning samples to be mutually orthogonal. Exact orthogonal coupling of samples is computationally intensive, hence approximate methods have been of great interest. In this paper, we present a unifying perspective of many approximate methods by considering Givens transformations, propose new approximate methods based on this framework, and demonstrate the first statistical guarantees for families of approximate methods in kernel approximation. We provide extensive empirical evaluations with guidance for practitioners.

#### Deep Convolutional Networks as shallow Gaussian Processes

Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison, 2019. (In International Conference on Learning Representations (ICLR)).

Abstract▼ URL

We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike “deep kernels”, has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84% classification error on MNIST, a new record for GPs with a comparable number of parameters.

#### Convolutional neural networks: A magic bullet for gravitational-wave detection?

Timothy Gebhard, Niki Kilbertus, Ian Harry, Bernhard Schölkopf, September 2019. (Physical Review D). American Physical Society. **DOI**: https://doi.org/10.1103/PhysRevD.100.063015.

Abstract▼ URL

In the last few years, machine learning techniques, in particular convolutional neural networks, have been investigated as a method to replace or complement traditional matched filtering techniques that are used to detect the gravitational-wave signature of merging black holes. However, to date, these methods have not yet been successfully applied to the analysis of long stretches of data recorded by the Advanced LIGO and Virgo gravitational-wave observatories. In this work, we critically examine the use of convolutional neural networks as a tool to search for merging black holes. We identify the strengths and limitations of this approach, highlight some common pitfalls in translating between machine learning and gravitational-wave astronomy, and discuss the interdisciplinary challenges. In particular, we explain in detail why convolutional neural networks alone cannot be used to claim a statistically significant gravitational-wave detection. However, we demonstrate how they can still be used to rapidly flag the times of potential signals in the data for a more detailed follow-up. Our convolutional neural network architecture as well as the proposed performance metrics are better suited for this task than a standard binary classifications scheme. A detailed evaluation of our approach on Advanced LIGO data demonstrates the potential of such systems as trigger generators. Finally, we sound a note of caution by constructing adversarial examples, which showcase interesting “failure modes” of our model, where inputs with no visible resemblance to real gravitational-wave signals are identified as such by the network with high confidence.

#### Meta-learning probabilistic inference for prediction

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, Richard Turner, April 2019. (In 7th International Conference on Learning Representations). New Orleans.

Abstract▼ URL

This paper introduces a new framework for data efficient and versatile learning. Specifically: 1) We develop ML-PIP, a general framework for Meta-Learning approximate Probabilistic Inference for Prediction. ML-PIP extends existing probabilistic interpretations of meta-learning to cover a broad class of methods. 2) We introduce , an instance of the framework employing a flexible and versatile amortization network that takes few-shot learning datasets as inputs, with arbitrary numbers of shots, and outputs a distribution over task-specific parameters in a single forward pass. Versa substitutes optimization at test time with forward passes through inference networks, amortizing the cost of inference and relieving the need for second derivatives during training. 3) We evaluate on benchmark datasets where the method sets new state-of-the-art results, and can handle arbitrary number of shots, and for classification, arbitrary numbers of classes at train and test time. The power of the approach is then demonstrated through a challenging few-shot ShapeNet view reconstruction task.

#### Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, Carl Edward Rasmussen, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. Inference in this setting has either employed computationally intensive MCMC methods, or relied on factorisations of the variational posterior. As we demonstrate in our experiments, the factorisation between latent system states and transition function can lead to a miscalibrated posterior and to learning unnecessarily large noise terms. We eliminate this factorisation by explicitly modelling the dependence between state trajectories and the Gaussian process posterior. Samples of the latent states can then be tractably generated by conditioning on this representation. The method we obtain (VCDT: variationally coupled dynamics and trajectories) gives better predictive performance and more calibrated estimates of the transition function, yet maintains the same time and space complexities as mean-field methods. Code is available at: https://github.com/ialong/GPt.

#### Successor Uncertainties: exploration and uncertainty in temporal difference learning

David Janz, Jiri Hron, Przemyslaw Mazur, José Miguel Hernández-Lobato, Katja Hofmann, Sebastian Tschiatschek, 2019. (NeurIPS).

Abstract▼ URL

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.

#### The sensitivity of counterfactual fairness to unmeasured confounding

Niki Kilbertus, Phil Ball, Matt Kusner, Adrian Weller, Ricardo Silva, July 2019. (In 35th Conference on Uncertainty in Artificial Intelligence). Tel Aviv.

Abstract▼ URL

Causal approaches to fairness have seen substantial recent interest, both from the machine learning community and from wider parties interested in ethical prediction algorithms. In no small part, this has been due to the fact that causal models allow one to simultaneously leverage data and expert knowledge to remove discriminatory effects from predictions. However, one of the primary assumptions in causal modeling is that you know the causal graph. This introduces a new opportunity for bias, caused by misspecifying the causal model. One common way for misspecification to occur is via unmeasured confounding: the true causal effect between variables is partially described by unobserved quantities. In this work we design tools to assess the sensitivity of fairness measures to this confounding for the popular class of non-linear additive noise models (ANMs). Specifically, we give a procedure for computing the maximum difference between two counterfactually fair predictors, where one has become biased due to confounding. For the case of bivariate confounding our technique can be swiftly computed via a sequence of closed-form updates. For multivariate confounding we give an algorithm that can be efficiently solved via automatic differentiation. We demonstrate our new sensitivity analysis tools in real-world fairness scenarios to assess the bias arising from confounding.

#### Optimal experimental design via Bayesian optimization: active causal structure learning for Gaussian process networks

J. von Kügelgen, P. K. Rubenstein, B. Schölkopf, A. Weller, December 2019. (In NeurIPS 2019 Workshop Do the right thing: machine learning and causal inference for improved decision making).

Abstract▼ URL

We study the problem of causal discovery through targeted interventions. Starting from few observational measurements, we follow a Bayesian active learning approach to perform those experiments which, in expectation with respect to the current model, are maximally informative about the underlying causal structure. Unlike previous work, we consider the setting of continuous random variables with non-linear functional relationships, modelled with Gaussian process priors. To address the arising problem of choosing from an uncountable set of possible interventions, we propose to use Bayesian optimisation to efficiently maximise a Monte Carlo estimate of the expected information gain.

#### Semi-Generative Modelling: Covariate-Shift Adaptation with Cause and Effect Features

J. von Kügelgen, A. Mey, M. Loog, 2019. (In 22nd International Conference on Artificial Intelligence and Statistics). Edited by Kamalika Chaudhuri, Masashi Sugiyama. PMLR.

Abstract▼ URL

Current methods for covariate-shift adaptation use unlabelled data to compute importance weights or domain-invariant features, while the final model is trained on labelled data only. Here, we consider a particular case of covariate shift which allows us also to learn from unlabelled data, that is, combining adaptation with semi-supervised learning. Using ideas from causality, we argue that this requires learning with both causes, X_C, and effects, X_E, of a target variable, Y, and show how this setting leads to what we call a semi-generative model, P(Y,X_E|X_C,θ). Our approach is robust to domain shifts in the distribution of causal features and leverages unlabelled data by learning a direct map from causes to effects. Experiments on synthetic data demonstrate significant improvements in classification over purely-supervised and importance-weighting baselines.

#### Train and Test Tightness of LP Relaxations in Structured Prediction

Ofer Meshi, Ben London, Adrian Weller, David Sontag, 2019. (Journal of Machine Learning Research).

Abstract▼ URL

Structured prediction is used in areas including computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation for the striking observation that approximations based on linear programming (LP) relaxations are often tight (exact) on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that this training tightness generalizes to test data.

#### Dropout as a Structured Shrinkage Prior

Eric Nalisnick, José Miguel Hernández-Lobato, Padhraic Smyth, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

Dropout regularization of deep neural networks has been a mysterious yet effective tool to prevent overfitting. Explanations for its success range from the prevention of co-adapted weights to it being a form of cheap Bayesian inference. We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli noise (i.e. dropout). We show that multiplicative noise induces structured shrinkage priors on a network’s weights. We derive the equivalence through reparametrization properties of scale mixtures and without invoking any approximations. Given the equivalence, we then show that dropout’s Monte Carlo training objective approximates marginal MAP estimation. We leverage these insights to propose a novel shrinkage framework for resnets, terming the prior ‘automatic depth determination’ as it is the natural analog of automatic relevance determination for network depth. Lastly, we investigate two inference strategies that improve upon the aforementioned MAP approximation in regression benchmarks.

#### Bayesian deep CNNs with many channels are Gaussian processes

Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein, 2019. (ICLR).

Abstract▼ URL

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit—a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

#### Benchmarking the neural linear model for regression

Sebastian W. Ober, Carl Edward Rasmussen, 2019. (In 2nd Symposium on Advances in Approximate Bayesian Inference).

Abstract▼ URL

The neural linear model is a simple adaptive Bayesian linear regression method that has recently been used in a number of problems ranging from Bayesian optimization to reinforcement learning. Despite its apparent successes in these settings, to the best of our knowledge there has been no systematic exploration of its capabilities on simple regression tasks. In this work we characterize these on the UCI datasets, a popular benchmark for Bayesian regression models, as well as on the recently introduced UCI “gap” datasets, which are better tests of out-of-distribution uncertainty. We demonstrate that the neural linear model is a simple method that shows generally good performance on these tasks, but at the cost of requiring good hyperparameter tuning.

#### Practical Deep Learning with Bayesian Principles

Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota, Mohammad Emtiyaz Khan, 2019. (In Advances in Neural Information Processing Systems 33).

Abstract▼ URL

Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation is available as a plug-and-play optimiser.

#### Bayesian batch active learning as sparse subset approximation

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, Jose Miguel Hernández-Lobato, 2019. (In Advances in Neural Information Processing Systems 33).

Abstract▼ URL

Leveraging the wealth of unlabeled data produced in recent years provides great potential for improving supervised models. When the cost of acquiring labels is high, probabilistic active learning methods can be used to greedily select the most informative data points to be labeled. However, for many large-scale problems standard greedy procedures become computationally infeasible and suffer from negligible model change. In this paper, we introduce a novel Bayesian batch active learning approach that mitigates these issues. Our approach is motivated by approximating the complete data posterior of the model parameters. While naive batch construction methods result in correlated queries, our algorithm produces diverse batches that enable efficient active learning at scale. We derive interpretable closed-form solutions akin to existing active learning procedures for linear models, and generalize to arbitrary models using random projections. We demonstrate the benefits of our approach on several large-scale regression and classification tasks.

#### Factored Contextual Policy Search with Bayesian Optimization

Robert Pinsler, Peter Karkus, Andras Kupcsik, David Hsu, Wee Sun Lee, May 2019. (In IEEE International Conference on Robotics and Automation). Montreal, Canada.

Abstract▼ URL

Scarce data is a major challenge to scaling robot learning to truly complex tasks, as we need to generalize locally learned policies over different task contexts. Contextual policy search offers data-efficient learning and generalization by explicitly conditioning the policy on a parametric context space. In this paper, we further structure the contextual policy representation. We propose to factor contexts into two components: target contexts that describe the task objectives, e.g. target position for throwing a ball; and environment contexts that characterize the environment, e.g. initial position or mass of the ball. Our key observation is that experience can be directly generalized over target contexts. We show that this can be easily exploited in contextual policy search algorithms. In particular, we apply factorization to a Bayesian optimization approach to contextual policy search both in sampling-based and active learning settings. Our simulation results show faster learning and better generalization in various robotic domains. See our supplementary video: https://youtu.be/MNTbBAOufDY.

#### Fast and Flexible Multi-Task Classification using Conditional Neural Adaptive Processes

James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, Richard E. Turner, 2019. (In Advances in Neural Information Processing Systems 33).

Abstract▼ URL

The goal of this paper is to design image classification systems that, after an initial multi-task training phase, can automatically adapt to new tasks encountered at test time. We introduce a conditional neural process based approach to the multi-task classification setting for this purpose, and establish connections to the meta- and few-shot learning literature. The resulting approach, called CNAPs, comprises a classifier whose parameters are modulated by an adaptation network that takes the current task’s dataset as input. We demonstrate that CNAPs achieves state-of-the-art results on the challenging Meta-Dataset benchmark indicating high-quality transfer-learning. We show that the approach is robust, avoiding both over-fitting in low-shot regimes and under-fitting in high-shot regimes. Timing experiments reveal that CNAPs is computationally efficient at test-time as it does not involve gradient based adaptation. Finally, we show that trained models are immediately deployable to continual learning and active learning where they can outperform existing approaches that do not leverage transfer learning.

#### The Gaussian Process Autoregressive Regression Model (GPAR)

James Requeima, William Tebbutt, Wessel Bruinsma, Richard E. Turner, 2019. (In 22nd International Conference on Artificial Intelligence and Statistics). Proceedings of Machine Learning Research.

Abstract▼ URL

Multi-output regression models must exploit dependencies between outputs to maximise predictive performance. The application of Gaussian processes (GPs) to this setting typically yields models that are computationally demanding and have limited representational power. We present the Gaussian Process Autoregressive Regression (GPAR) model, a scalable multi-output GP model that is able to capture nonlinear, possibly input-varying, dependencies between outputs in a simple and tractable way: the product rule is used to decompose the joint distribution over the outputs into a set of conditionals, each of which is modelled by a standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and real-world problems, outperforming existing GP models and achieving state-of-the-art performance on established benchmarks.

#### Orthogonal Estimation of Wasserstein Distances

Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, Adrian Weller, April 2019. (In 22nd International Conference on Artificial Intelligence and Statistics). Okinawa, Japan.

Abstract▼ URL

Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning.

#### Formally justified and modular Bayesian inference for probabilistic programs

Adam Ścibior, 2019. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Probabilistic modelling offers a simple and coherent framework to describe the real world in the face of uncertainty. Furthermore, by applying Bayes’ rule it is possible to use probabilistic models to make inferences about the state of the world from partial observations. While traditionally probabilistic models were constructed on paper, more recently the approach of probabilistic programming enables users to write the models in executable languages resembling computer programs and to freely mix them with deterministic code. It has long been recognised that the semantics of programming languages is complicated and the intuitive understanding that programmers have is often inaccurate, resulting in difficult to understand bugs and unexpected program behaviours. Programming languages are therefore studied in a rigorous way using formal languages with mathematically defined semantics. Traditionally formal semantics of probabilistic programs are defined using exact inference results, but in practice exact Bayesian inference is not tractable and approximate methods are used instead, posing a question of how the results of these algorithms relate to the exact results. Correctness of such approximate methods is usually argued somewhat less rigorously, without reference to a formal semantics. In this dissertation we formally develop denotational semantics for probabilistic programs that correspond to popular sampling algorithms often used in practice. The semantics is defined for an expressive typed lambda calculus with higher-order functions and inductive types, extended with probabilistic effects for sampling and conditioning, allowing continuous distributions and unbounded likelihoods. It makes crucial use of the recently developed formalism of quasi-Borel spaces to bring all these elements together. We provide semantics corresponding to several variants of Markov chain Monte Carlo and Sequential Monte Carlo methods and formally prove a notion of correctness for these algorithms in the context of probabilistic programming. We also show that the semantic construction can be directly mapped to an implementation using established functional programming abstractions called monad transformers. We develop a compact Haskell library for probabilistic programming closely corresponding to the semantic construction, giving users a high level of assurance in the correctness of the implementation. We also demonstrate on a collection of benchmarks that the library offers performance competitive with existing systems of similar scope. An important property of our construction, both the semantics and the implementation, is the high degree of modularity it offers. All the inference algorithms are constructed by combining small building blocks in a setup where the type system ensures correctness of compositions. We show that with basic building blocks corresponding to vanilla Metropolis-Hastings and Sequential Monte Carlo we can implement more advanced algorithms known in the literature, such as Resample-Move Sequential Monte Carlo, Particle Marginal Metropolis-Hastings, and Sequential Monte Carlo squared. These implementations are very concise, reducing the effort required to produce them and the scope for bugs. On top of that, our modular construction enables in some cases deterministic testing of randomised inference algorithms, further increasing reliability of the implementation.

#### Leader stochastic gradient descent (LSGD) for distributed training of deep learning models

Yunfei Teng, Wenbo Gao, Francois Chalus, Anna Choromanska, Donald Goldfarb, Adrian Weller, December 2019. (In Advances in Neural Information Processing Systems 33). Vancouver.

Abstract▼ URL

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.

#### Bayesian learning of sum-product networks

Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, Zoubin Ghahramani, December 2019. (In Advances in Neural Information Processing Systems 33). Vancouver.

Abstract▼ URL

Sum-product networks (SPNs) are flexible density estimators and have received significant attention due to their attractive inference properties. While parameter learning in SPNs is well developed, structure learning leaves something to be desired: Even though there is a plethora of SPN structure learners, most of them are somewhat ad-hoc and based on intuition rather than a clear learning principle. In this paper, we introduce a well-principled Bayesian framework for SPN structure learning. First, we decompose the problem into i) laying out a computational graph, and ii) learning the so-called scope function over the graph. The first is rather unproblematic and akin to neural network architecture validation. The second represents the effective structure of the SPN and needs to respect the usual structural constraints in SPN, i.e. completeness and decomposability. While representing and learning the scope function is somewhat involved in general, in this paper, we propose a natural parametrisation for an important and widely used special case of SPNs. These structural parameters are incorporated into a Bayesian model, such that simultaneous structure and parameter learning is cast into monolithic Bayesian posterior inference. In various experiments, our Bayesian SPNs often improve test likelihoods over greedy SPN learners. Further, since the Bayesian framework protects against overfitting, we can evaluate hyper-parameters directly on the Bayesian model score, waiving the need for a separate validation set, which is especially beneficial in low data regimes. Bayesian SPNs can be applied to heterogeneous domains and can easily be extended to nonparametric formulations. Moreover, our Bayesian approach is the first, which consistently and robustly learns SPN structures under missing data.

## 2018

#### Discovering interpretable representations for both deep generative and discriminative models

Tameem Adel, Zoubin Ghahramani, Adrian Weller, July 2018. (In 35th International Conference on Machine Learning). Stockholm Sweden.

Abstract▼ URL

Interpretability of representations in both deep generative and discriminative models is highly desirable. Current methods jointly optimize an objective combining accuracy and interpretability. However, this may reduce accuracy, and is not applicable to already trained models. We propose two interpretability frameworks. First, we provide an interpretable lens for an existing model. We use a generative model which takes as input the representation in an existing (generative or discriminative) model, weakly supervised by limited side information. Applying a flexible and invertible transformation to the input leads to an interpretable representation with no loss in accuracy. We extend the approach using an active learning strategy to choose the most useful side information to obtain, allowing a human to guide what “interpretable” means. Our second framework relies on joint optimization for a representation which is both maximally informative about the side information and maximally compressive about the non-interpretable data factors. This leads to a novel perspective on the relationship between compression and regularization. We also propose a new interpretability evaluation metric based on our framework. Empirically, we achieve state-of-the-art results on three datasets using the two proposed algorithms.

#### Gauged Mini-Bucket Elimination for Approximate Inference

Sungsoo Ahn, Michael Chertkov, Jinwoo Shin, Adrian Weller, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

Computing the partition function Z of a discrete graphical model is a fundamental inference challenge. Since this is computationally intractable, variational approximations are often used in practice. Recently, so-called gauge transformations were used to improve variational lower bounds on Z. In this paper, we propose a new gauge-variational approach, termed WMBE-G, which combines gauge transformations with the weighted mini-bucket elimination (WMBE) method. WMBE-G can provide both upper and lower bounds on Z, and is easier to optimize than the prior gauge-variational algorithm. We show that WMBE-G strictly improves the earlier WMBE approximation for symmetric models including Ising models with no magnetic field. Our experimental results demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric models.

#### Bucket renormalization for approximate inference

Sungsoo Ahn, Michael Chertkov, Adrian Weller, Jinwoo Shin, 2018. (In 35th International Conference on Machine Learning).

Abstract▼ URL

Probabilistic graphical models are a key tool in machine learning applications. Computing the partition function, i.e., normalizing constant, is a fundamental task of statistical inference but it is generally computationally intractable, leading to extensive study of approximation methods. Iterative variational methods are a popular and successful family of approaches. However, even state of the art variational methods can return poor results or fail to converge on difficult instances. In this paper, we instead consider computing the partition function via sequential summation over variables. We develop robust approximate algorithms by combining ideas from mini-bucket elimination with tensor network and renormalization group methods from statistical physics. The resulting “convergence-free” methods show good empirical performance on both synthetic and real-world benchmark models, even for difficult instances.

#### Differentially Private Database Release via Kernel Mean Embeddings

Matej Balog, Ilya Tolstikhin, Bernhard Schölkopf, July 2018. (In 35th International Conference on Machine Learning). Stockholm, Sweden.

Abstract▼ URL

We lay theoretical foundations for new database release mechanisms that allow third-parties to construct consistent estimators of population statistics, while ensuring that the privacy of each individual contributing to the database is protected. The proposed framework rests on two main ideas. First, releasing (an estimate of) the kernel mean embedding of the data generating random variable instead of the database itself still allows third-parties to construct consistent estimators of a wide class of population statistics. Second, the algorithm can satisfy the definition of differential privacy by basing the released kernel mean embedding on entirely synthetic data points, while controlling accuracy through the metric available in a Reproducing Kernel Hilbert Space. We describe two instantiations of the proposed framework, suitable under different scenarios, and prove theoretical results guaranteeing differential privacy of the resulting algorithms and the consistency of estimators constructed from their outputs.

**Comment:** [arXiv]

#### Nonlinear Set Membership Regression with Adaptive Hyper-Parameter Estimation for Online Learning and Control

Jan-Peter Calliess, Stephen Roberts, Carl Edward Rasmussen, Jan Maciejowski, 2018. (In Proceedings of the European Control Conference).

Abstract▼ URL

Methods known as Lipschitz Interpolation or Nonlinear Set Membership regression have become established tools for nonparametric system-identification and data-based control. They utilise presupposed Lipschitz properties to compute inferences over unobserved function values. Unfortunately, they rely on the a priori knowledge of a Lipschitz constant of the underlying target function which serves as a hyperparameter. We propose a closed-form estimator of the Lipschitz constant that is robust to bounded observational noise in the data. The merger of Lipschitz Interpolation with the new hyperparameter estimator gives a new nonparametric machine learning method for which we derive online learning convergence guarantees. Furthermore, we apply our learning method to model-reference adaptive control and provide a convergence guarantee on the closed-loop dynamics. In a simulated flight manoeuvre control scenario, we compare the performance of our approach to recently proposed alternative learning-based controllers.

#### The Geometry of Random Features

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E. Turner, Adrian Weller, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal (in terms of mean squared error) kernel estimators. We provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms.

#### Structured evolution with compact architectures for scalable policy optimization

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, Adrian Weller, July 2018. (In 35th International Conference on Machine Learning). Stockholm Sweden.

Abstract▼ URL

We present a new method of blackbox optimization via gradient approximation with the use of structured random orthogonal matrices, providing more accurate estimators than baselines and with provable theoretical guarantees. We show that this algorithm can be successfully applied to learn better quality compact policies than those using standard gradient estimation techniques. The compact policies we learn have several advantages over unstructured ones, including faster training algorithms and faster inference. These benefits are important when the policy is deployed on real hardware with limited resources. Further, compact policies provide more scalable architectures for derivative-free optimization (DFO) in high dimensional spaces. We show that most robotics tasks from the OpenAI Gym can be solved using neural networks with less than 300 parameters, with almost linear time complexity of the inference phase, with up to 13x fewer parameters relative to the Evolution Strategies (ES) algorithm introduced by Salimans et al. (2017). We do not need heuristics such as fitness shaping to learn good quality policies, resulting in a simple and theoretically motivated training mechanism.

#### Quantum machine learning: a classical perspective

Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, Leonard Wossnig, 2018. (In Proc. R. Soc. A). **DOI**: 10.1098/rspa.2017.0551.

Abstract▼

Recently, increased computational power and data availability, as well as algorithmic advances, have led machine learning techniques to impressive results in regression, classification, data-generation and reinforcement learning tasks. Despite these successes, the proximity to the physical limits of chip fabrication alongside the increasing size of datasets are motivating a growing number of researchers to explore the possibility of harnessing the power of quantum computation to speed-up classical machine learning algorithms. Here we review the literature in quantum machine learning and discuss perspectives for a mixed readership of classical machine learning and quantum computation experts. Particular emphasis will be placed on clarifying the limitations of quantum algorithms, how they compare with their best classical counterparts and why quantum resources are expected to provide advantages for learning problems. Learning in the presence of noise and certain computationally hard problems in machine learning are identified as promising directions for the field. Practical questions, like how to upload classical data into quantum form, will also be addressed.

#### Distributional Reinforcement Learning with Quantile Regression

Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos, February 2018. (In 32nd AAAI Conference on Artificial Intelligence). New Orleans.

Abstract▼ URL

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.

#### Gaussian Process Conditional Density Estimation

Vincent Dutordoir, Hugh Salimbeni, Marc Deisenroth, James Hensman, Dec 2018. (In Advances in Neural Information Processing Systems 32). Montréal, Canada.

Abstract▼ URL

Conditional Density Estimation (CDE) models deal with estimating conditional distributions. The conditions imposed on the distribution are the inputs of the model. CDE is a challenging task as there is a fundamental trade-off between model complexity, representational capacity and overfitting. In this work, we propose to extend the model’s input with latent variables and use Gaussian processes (GP) to map this augmented input onto samples from the conditional distribution. Our Bayesian approach allows for the modeling of small datasets, but we also provide the machinery for it to be applied to big data using stochastic variational inference. Our approach can be used to model densities even in sparse data regions, and allows for sharing learned structure between conditions. We illustrate the effectiveness and wide-reaching applicability of our model on a variety of real- world problems, such as spatio-temporal density estimation of taxi drop-offs, non-Gaussian noise modeling, and few-shot learning on omniglot images.

#### Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, Sergey Levine, Apr 2018. (In 6th International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

Deep reinforcement learning algorithms can learn complex behavioral skills, but real-world application of these methods requires a large amount of experience to be collected by the agent. In practical settings, such as robotics, this involves repeatedly attempting a task, resetting the environment between each attempt. However, not all tasks are easily or automatically reversible. In practice, this learning process requires extensive human intervention. In this work, we propose an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and reset policy, with the reset policy resetting the environment for a subsequent attempt. By learning a value function for the reset policy, we can automatically determine when the forward policy is about to enter a non-reversible state, providing for uncertainty-aware safety aborts. Our experiments illustrate that proper use of the reset policy can greatly reduce the number of manual resets required to learn a task, can reduce the number of unsafe actions that lead to non-reversible states, and can automatically induce a curriculum.

**Comment:** [Video]

#### Human perceptions of fairness in algorithmic decision making: A case study of criminal risk prediction

Nina Grgić-Hlača, Elissa Redmiles, Krishna P. Gummadi, Adrian Weller, April 2018. (In The Web Conference (WWW)). Lyon.

Abstract▼ URL

As algorithms are increasingly used to make important decisions that affect human lives, ranging from social benefit assignment to predicting risk of criminal recidivism, concerns have been raised about the fairness of algorithmic decision making. Most prior works on algorithmic fairness normatively prescribe how fair decisions ought to be made. In contrast, here, we descriptively survey users for how they perceive and reason about fairness in algorithmic decision making. A key contribution of this work is the framework we propose to understand why people perceive certain features as fair or unfair to be used in algorithms. Our framework identifies eight properties of features, such as relevance, volitionality and reliability, as latent considerations that inform people’s moral judgments about the fairness of feature use in decision-making algorithms. We validate our framework through a series of scenario-based surveys with 576 people. We find that, based on a person’s assessment of the eight latent properties of a feature in our exemplar scenario, we can accurately (> 85%) predict if the person will judge the use of the feature as fair. Our findings have important implications. At a high-level, we show that people’s unfairness concerns are multi-dimensional and argue that future studies need to address unfairness concerns beyond discrimination. At a low-level, we find considerable disagreements in people’s fairness judgments. We identify root causes of the disagreements, and note possible pathways to resolve them.

#### Beyond Distributive Fairness in Algorithmic Decision Making: Feature Selection for Procedurally Fair Learning

N. Grgić-Hlača, M. B. Zafar, K. P. Gummadi, A. Weller, February 2018. (In 32nd AAAI Conference on Artificial Intelligence). New Orleans.

Abstract▼ URL

With wide-spread usage of machine learning methods in numerous domains involving human subjects, several studies have raised questions about the potential for unfairness towards certain individuals or groups. A number of recent works have proposed methods to measure and eliminate unfairness from machine learning methods. However, most of this work on fair learning has focused on only one dimension of fair decision making: distributive fairness, i.e., the fairness of the decision outcomes. In this work, we leverage the rich literature on organizational justice and focus on another dimension of fair decision making: procedural fairness, i.e., the fairness of the decision making process. We propose measures for procedural fairness that consider the input features used in the decision process, and evaluate the moral judgments of humans regarding the use of these features. We operationalize these measures on two real world datasets using human surveys on the Amazon Mechanical Turk (AMT) platform, demonstrating that we capture important properties of procedurally fair decision making. We provide fast submodular mechanisms to optimize the tradeoff between procedural fairness and prediction accuracy. On our datasets, we observe empirically that procedural fairness may be achieved with little cost to outcome fairness, but that some loss of accuracy is unavoidable.

#### Variational Bayesian dropout: pitfalls and fixes

Jiri Hron, Alexander G. D. G. Matthews, Zoubin Ghahramani, 2018. (ICML).

Abstract▼ URL

Dropout, a stochastic regularisation technique for training of neural networks, has recently been reinterpreted as a specific type of approximate inference algorithm for Bayesian neural networks. The main contribution of the reinterpretation is in providing a theoretical framework useful for analysing and extending the algorithm. We show that the proposed framework suffers from several issues; from undefined or pathological behaviour of the true posterior related to use of improper priors, to an ill-defined variational objective due to singularity of the approximating distribution relative to the true posterior. Our analysis of the improper log uniform prior used in variational Gaussian dropout suggests the pathologies are generally irredeemable, and that the algorithm still works only because the variational formulation annuls some of the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL) divergence, a new approximate inference objective for approximation of high-dimensional distributions. We show that motivations for variational Bernoulli dropout based on discretisation and noise have QKL as a limit. Properties of QKL are studied both theoretically and on a simple practical example which shows that the QKL-optimal approximation of a full rank Gaussian with a degenerate one naturally leads to the Principal Component Analysis solution.

#### Non-Factorised Variational Inference in Dynamical Systems

Alessandro Davide Ialongo, Mark van der Wilk, James Hensman, Carl Edward Rasmussen, December 2018. (In First Symposium on Advances in Approximate Bayesian Inference). Montreal.

Abstract▼ URL

We focus on variational inference in dynamical systems where the discrete time transition function (or evolution rule) is modelled by a Gaussian process. The dominant approach so far has been to use a factorised posterior distribution, decoupling the transition function from the system states. This is not exact in general and can lead to an overconfident posterior over the transition function as well as an overestimation of the intrinsic stochasticity of the system (process noise). We propose a new method that addresses these issues and incurs no additional computational costs.

#### Blind justice: Fairness with encrypted sensitive attributes

Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna P. Gummadi, Adrian Weller, July 2018. (In 35th International Conference on Machine Learning). Stockholm Sweden.

Abstract▼ URL

Recent work has explored how to train machine learning models which do not discriminate against any subgroup of the population as determined by sensitive attributes such as gender or race. To avoid disparate treatment, sensitive attributes should not be considered. On the other hand, in order to avoid disparate impact, sensitive attributes must be examined — e.g., in order to learn a fair model, or to check if a given model is fair. We introduce methods from secure multi-party computation which allow us to avoid both. By encrypting sensitive attributes, we show how an outcome based fair model may be learned, checked, or have its outputs verified and held to account, without users revealing their sensitive attributes.

#### Scalable Magnetic Field SLAM in 3D Using Gaussian Process Maps

Manon Kok, Arno Solin, July 2018. (In Proceedings of the 21th International Conference on Information Fusion (accepted for publication)). Cambridge, UK.

Abstract▼ URL

We present a method for scalable and fully 3D magnetic field simultaneous localisation and mapping (SLAM) using local anomalies in the magnetic field as a source of position information. These anomalies are due to the presence of ferromagnetic material in the structure of buildings and in objects such as furniture. We represent the magnetic field map using a Gaussian process model and take well-known physical properties of the magnetic field into account. We build local magnetic field maps using three-dimensional hexagonal block tiling. To make our approach computationally tractable we use reduced-rank Gaussian process regression in combination with a Rao–Blackwellised particle filter. We show that it is possible to obtain accurate position and orientation estimates using measurements from a smartphone, and that our approach provides a scalable magnetic SLAM algorithm in terms of both computational complexity and map storage.

#### Disentangled Sequential Autoencoder

Yingzhen Li, Stephan Mandt, July 2018. (In 35th International Conference on Machine Learning). Stockholm Sweden.

Abstract▼ URL

We present a VAE architecture for encoding and generating high dimensional sequential data, such as video or audio. Our deep generative model learns a latent representation of the data which is split into a static and dynamic part, allowing us to approximately disentangle latent time-dependent features (dynamics) from features which are preserved over time (content). This architecture gives us partial control over generating content and dynamics by conditioning on either one of these sets of features. In our experiments on artificially generated cartoon video clips and voice recordings, we show that we can convert the content of a given sequence into another one by such content swapping. For audio, this allows us to convert a male speaker into a female speaker and vice versa, while for video we can separately manipulate shapes and dynamics. Furthermore, we give empirical evidence for the hypothesis that stochastic RNNs as latent state models are more efficient at compressing and generating long sequences than deterministic ones, which may be relevant for applications in video compression.

#### Gradient Estimators for Implicit Models

Yingzhen Li, Richard E. Turner, May 2018. (In Sixth International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

Implicit models, which allow for the generation of samples but not for point-wise evaluation of probabilities, are omnipresent in real-world problems tackled by machine learning and a hot topic of current research. Some examples include data simulators that are widely used in engineering and scientific research, generative adversarial networks (GANs) for image synthesis, and hot-off-the-press approximate inference techniques relying on implicit distributions. The majority of existing approaches to learning implicit models rely on approximating the intractable distribution or optimisation objective for gradient- based optimisation, which is liable to produce inaccurate updates and thus poor models. This paper alleviates the need for such approximations by proposing the Stein gradient estimator, which directly estimates the score function of the implicitly defined distribution. The efficacy of the proposed estimator is empirically demonstrated by examples that include meta-learning for approximate inference and entropy regularised GANs that provide improved sample diversities.

#### Antithetic and Monte Carlo kernel estimators for partial rankings

Maria Lomeli, Mark Rowland, Arthur Gretton, Zoubin Ghahramani, 2018. (arXiv preprint arXiv:1807.00400).

Abstract▼ URL

In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which prevents the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator for the Mallows kernel. The corresponding antithetic kernel estimator has lower variance and we demonstrate empirically that it has a better performance in a variety of Machine Learning tasks. Both kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. An overview of the existing kernels and metrics for permutations is also provided.

#### Gaussian process behaviour in wide deep neural networks

Alexander G. D. G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, Zoubin Ghahramani, 2018. (ICLR).

Abstract▼ URL

Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature.

#### Variational Continual Learning

Cuong V. Nguyen, Yingzhen Li, Thang D. Bui Richard E. Turner, May 2018. (In Sixth International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that variational continual learning outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.

#### Learning Independent Causal Mechanisms

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, Bernhard Schölkopf, July 2018. (In 35th International Conference on Machine Learning). Stockholm Sweden.

Abstract▼ URL

Statistical learning relies upon data sampled from a distribution, and we usually do not care what actually generated it in the first place. From the point of view of causal modeling, the structure of each distribution is induced by physical mechanisms that give rise to dependences between observables. Mechanisms, however, can be meaningful autonomous modules of generative models that make sense beyond a particular entailed data distribution, lending themselves to transfer between problems. We develop an algorithm to recover a set of independent (inverse) mechanisms from a set of transformed data points. The approach is unsupervised and based on a set of experts that compete for data generated by the mechanisms, driving specialization. We analyze the proposed method in a series of experiments on image data. Each expert learns to map a subset of the transformed data back to a reference distribution. The learned mechanisms generalize to novel domains. We discuss implications for transfer learning and links to recent trends in generative modeling.

#### PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya, 2018. (In 35th International Conference on Machine Learning).

Abstract▼ URL

Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by 106 times.

#### Sample and Feedback Efficient Hierarchical Reinforcement Learning from Human Preferences

Robert Pinsler, Riad Akrour, Takayuki Osa, Jan Peters, Gerhard Neumann, May 2018. (In IEEE International Conference on Robotics and Automation). Brisbane, Australia.

Abstract▼ URL

While reinforcement learning has led to promising results in robotics, defining an informative reward function is challenging. Prior work considered including the human in the loop to jointly learn the reward function and the optimal policy. Generating samples from a physical robot and requesting human feedback are both taxing efforts for which efficiency is critical. We propose to learn reward functions from both the robot and the human perspectives to improve on both efficiency metrics. Learning a reward function from the human perspective increases feedback efficiency by assuming that humans rank trajectories according to a low-dimensional outcome space. Learning a reward function from the robot perspective circumvents the need for a dynamics model while retaining the sample efficiency of model-based approaches. We provide an algorithm that incorporates bi-perspective reward learning into a general hierarchical reinforcement learning framework and demonstrate the merits of our approach on a toy task and a simulated robot grasping task.

#### Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Vitchyr Pong, Shixiang Gu, Murtaza Dalal, Sergey Levine, Apr 2018. (In 6th International Conference on Learning Representations). Vancouver CANADA.

Abstract▼ URL

Model-free reinforcement learning (RL) has been proven to be a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even for off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods.

#### An Analysis of Categorical Distributional Reinforcement Learning

Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, Yee Whye Teh, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

Distributional approaches to value-based reinforcement learning model the entire distribution of returns, rather than just their expected values, and have recently been shown to yield state-of-the-art empirical performance. This was demonstrated by the recently proposed C51 algorithm, based on categorical distributional reinforcement learning (CDRL) [Bellemare et al., 2017]. However, the theoretical properties of CDRL algorithms are not yet well understood. In this paper, we introduce a framework to analyse CDRL algorithms, establish the importance of the projected distributional Bellman operator in distributional RL, draw fundamental connections between CDRL and the Cramér distance, and give a proof of convergence for sample-based categorical distributional reinforcement learning algorithms.

#### Geometrically coupled Monte Carlo sampling

Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas Sarlos, Richard Turner, Adrian Weller, December 2018. (In Advances in Neural Information Processing Systems 32). Montreal Canada.

Abstract▼ URL

Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including quasi-Monte Carlo, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling.

#### Functional programming for modular Bayesian inference

Adam Ścibior, Ohad Kammar, Zoubin Ghahramani, 2018. (Proceedings of the ACM on Programming Languages).

Abstract▼ URL

We present an architectural design of a library for Bayesian modelling and inference in modern functional programming languages. The novel aspect of our approach are modular implementations of existing state-of-the-art inference algorithms. Our design relies on three inherently functional features: higher-order functions, inductive data-types, and support for either type-classes or an expressive module system. We provide a performant Haskell implementation of this architecture, demonstrating that high-level and modular probabilistic programming can be added as a library in sufficiently expressive languages. We review the core abstractions in this architecture: inference representations, inference transformations, and inference representation transformers. We then implement concrete instances of these abstractions, counterparts to particle filters and Metropolis-Hastings samplers, which form the basic building blocks of our library. By composing these building blocks we obtain state-of-the-art inference algorithms: Resample-Move Sequential Monte Carlo, Particle Marginal Metropolis-Hastings, and Sequential Monte Carlo Squared. We evaluate our implementation against existing probabilistic programming systems and find it is already competitively performant, although we conjecture that existing functional programming optimisation techniques could reduce the overhead associated with the abstractions we use. We show that our modular design enables deterministic testing of inherently stochastic Monte Carlo algorithms. Finally, we demonstrate using OCaml that an expressive module system can also implement our design.

#### Denotational Validation of Higher-Order Bayesian Inference

Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang, Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, Zoubin Ghahramani, 2018. (Proceedings of the ACM on Programming Languages).

Abstract▼ URL

We present a modular semantic account of Bayesian inference algorithms for probabilistic programming languages, as used in data science and machine learning. Sophisticated inference algorithms are often explained in terms of composition of smaller parts. However, neither their theoretical justification nor their implementation reflects this modularity. We show how to conceptualise and analyse such inference algorithms as manipulating intermediate representations of probabilistic programs using higher-order functions and inductive types, and their denotational semantics. Semantic accounts of continuous distributions use measurable spaces. However, our use of higher-order functions presents a substantial technical difficulty: it is impossible to define a measurable space structure over the collection of measurable functions between arbitrary measurable spaces that is compatible with standard operations on those functions, such as function application. We overcome this difficulty using quasi-Borel spaces, a recently proposed mathematical structure that supports both function spaces and continuous distributions. We define a class of semantic structures for representing probabilistic programs, and semantic validity criteria for transformations of these representations in terms of distribution preservation. We develop a collection of building blocks for composing representations. We use these building blocks to validate common inference algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To emphasize the connection between the semantic manipulation and its traditional measure theoretic origins, we use Kock’s synthetic measure theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to the Metropolis-Hastings-Green theorem.

#### A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual and Group Unfairness via Inequality Indices

Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P. Gummadi, Adish Singla, Adrian Weller, Muhammad Bilal Zafar, 2018. (In KDD).

Abstract▼ URL

Discrimination via algorithmic decision making has received considerable attention. Prior work largely focuses on defining conditions for fairness, but does not define satisfactory measures of algorithmic unfairness. In this paper, we focus on the following question: Given two unfair algorithms, how should we determine which of the two is more unfair? Our core idea is to use existing inequality indices from economics to measure how unequally the outcomes of an algorithm benefit different individuals or groups in a population. Our work offers a justified and general framework to compare and contrast the (un)fairness of algorithmic predictors. This unifying approach enables us to quantify unfairness both at the individual and the group level. Further, our work reveals overlooked tradeoffs between different fairness notions: using our proposed measures, the overall individual-level unfairness of an algorithm can be decomposed into a between-group and a within-group component. Earlier methods are typically designed to tackle only between-group unfairness, which may be justified for legal or other reasons. However, we demonstrate that minimizing exclusively the between-group component may, in fact, increase the within-group, and hence the overall unfairness. We characterize and illustrate the tradeoffs between our measures of (un)fairness and the prediction accuracy.

#### Sum-Product Autoencoding: Encoding and Decoding Representations using Sum-Product Networks

Antonio Vergari, Robert Peharz, Nicola Di Mauro, Alejandro Molina, Kristian Kersting, Floriana Esposito, February 2018. (In 32nd AAAI Conference on Artificial Intelligence). New Orleans, USA.

Abstract▼

Abstract Sum-Product Networks (SPNs) are a deep probabilistic architecture that up to now has been successfully employed for tractable inference. Here, we extend their scope towards unsupervised representation learning: we encode samples into continuous and categorical embeddings and show that they can also be decoded back into the original input space by leveraging MPE inference. We characterize when this Sum-Product Autoencoding (SPAE) leads to equivalent reconstructions and extend it towards dealing with missing embedding information. Our experimental results on several multilabel classification problems demonstrate that SPAE is competitive with state-of-the-art autoencoder architectures, even if the SPNs were never trained to reconstruct their inputs.

#### Turing: A Language for Flexible Probabilistic Inference

Hong Ge, Kai Xu, Zoubin Ghahramani, 09–11 Apr 2018. (In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics). Edited by Amos Storkey, Fernando Perez-Cruz. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Probabilistic programming promises to simplify and democratize probabilistic machine learning, but successful probabilistic programming systems require flexible, generic and efficient inference engines. In this work, we present a system called Turing for building MCMC algorithms for probabilistic programming inference. Turing has a very simple syntax and makes full use of the numerical capabilities in the Julia programming language, including all implemented probability distributions, and automatic differentiation. Turing supports a wide range of popular Monte Carlo algorithms, including Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling, sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers. Most importantly, Turing inference is composable: it combines MCMC operations on subsets of variables, for example using a combination of an HMC engine and a particle Gibbs (PG) engine. We explore several combinations of inference methods with the aim of finding approaches that are both efficient and universal, i.e. applicable to arbitrary probabilistic models. NUTS—a popular variant of HMC that adapts Hamiltonian simulation path length automatically, although quite powerful for exploring differentiable target distributions, is however not universal. We identify some failure modes for the NUTS engine, and demonstrate that composition of PG (for discrete variables) and NUTS (for continuous variables) can be useful when the NUTS engine is either not applicable, or simply does not work well. Our aim is to present Turing and its composable inference engines to the world and encourage other researchers to build on this system to help advance the field of probabilistic machine learning.

## 2017

#### DeepCoder: Learning to Write Programs

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, Daniel Tarlow, April 2017. (In 5th International Conference on Learning Representations). Toulon, France.

Abstract▼ URL

We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network’s predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.

#### Lost Relatives of the Gumbel Trick

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, Adrian Weller, August 2017. (In 34th International Conference on Machine Learning). Sydney, Australia.

Abstract▼ URL

The Gumbel trick is a method to sample from a discrete probability distribution, or to estimate its normalizing partition function. The method relies on repeatedly applying a random perturbation to the distribution in a particular way, each time solving for the most likely configuration. We derive an entire family of related methods, of which the Gumbel trick is one member, and show that the new methods have superior properties in several settings with minimal additional computational cost. In particular, for the Gumbel trick to yield computational benefits for discrete graphical models, Gumbel perturbations on all configurations are typically replaced with so-called low-rank perturbations. We show how a subfamily of our new methods adapts to this setting, proving new upper and lower bounds on the log partition function and deriving a family of sequential samplers for the Gibbs distribution. Finally, we balance the discussion by showing how the simpler analytical form of the Gumbel trick enables additional theoretical results.

#### Mapping Intelligence: Requirements and Possibilities

Sankalp Bhatnagar, Anna Alexandrova, Shahar Avin, Stephen Cave, Lucy Cheke, Matthew Crosby, Jan Feyereisl, Marta Halina, Bao Sheng Loe, Sean o Heigeartaigh, Fernando Martínez-Plumed, Huw Price, Henry Shevlin, Adrian Weller, Alan Winfield, Jose Hernandez-Orallo, 2017. (In Philosophy and Theory of Artificial Intelligence (PT-AI)).

Abstract▼ URL

New types of artificial intelligence (AI), from cognitive assistants to social robots, are challenging meaningful comparison with other kinds of intelligence. How can such intelligent systems be catalogued, evaluated, and contrasted, with representations and projections that offer meaningful insights? To catalyse the research in AI and the future of cognition, we present the motivation, requirements and possibilities for an atlas of intelligence: an integrated framework and collaborative open repository for collecting and exhibiting information of all kinds of intelligence, including humans, non-human animals, AI systems, hybrids and collectives thereof. After presenting this initiative, we review related efforts and present the requirements of such a framework. We survey existing visualisations and representations, and discuss which criteria of inclusion should be used to configure an atlas of intelligence.

#### Sampling and inference for discrete random probability measures in probabilistic programs

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria Lomeli, Zoubin Ghahramani, December 2017. (In NIPS workshop on Advances in Approximate Inference). California, United States.

Abstract▼ URL

We consider the problem of sampling a sequence from a discrete random prob- ability measure (RPM) with countable support, under (probabilistic) constraints of finite memory and computation. A canonical example is sampling from the Dirichlet Process, which can be accomplished using its well-known stick-breaking representation and lazy initialization of its atoms. We show that efficiently lazy initialization is possible if and only if a size-biased representation of the discrete RPM is known. For models constructed from such discrete RPMs, we consider the implications for generic particle-based inference methods in probabilistic program- ming systems. To demonstrate, we implement posterior inference for Normalized Inverse Gaussian Process mixture models in Turing.

#### Streaming sparse Gaussian process approximations

Thang D. Bui, Cuong V. Nguyen, Richard E. Turner, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California, USA.

Abstract▼ URL

Sparse approximations for Gaussian process models provide a suite of methods that enable these models to be deployed in large data regime and enable analytic intractabilities to be sidestepped. However, the field lacks a principled method to handle streaming data in which the posterior distribution over function values and the hyperparameters are updated in an online fashion. The small number of existing approaches either use suboptimal hand-crafted heuristics for hyperparameter learning, or suffer from catastrophic forgetting or slow updating when new data arrive. This paper develops a new principled framework for deploying Gaussian process probabilistic models in the streaming setting, providing principled methods for learning hyperparameters and optimising pseudo-input locations. The proposed framework is experimentally validated using synthetic and real-world datasets.

**Comment:** The first two authors contributed equally.

#### A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation

Thang D. Bui, Josiah Yan, Richard E. Turner, 2017. (Journal of Machine Learning Research).

Abstract▼ URL

Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a parsimonious, flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. In this paper we develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous venerable work in this area, the new framework is built on standard methods for approximate inference (variational free-energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way all of the approximation is performed at `inference time' rather than at`

modelling time’, resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.

#### Lipschitz Optimisation for Lipschitz Interpolation

Jan-Peter Calliess, May 2017. (In 2017 American Control Conference (ACC 2017)). Seattle, WA, USA.

Abstract▼ URL

Techniques known as Nonlinear Set Membership prediction, Kinky Inference or Lipschitz Interpolation are fast and numerically robust approaches to nonparametric machine learning that have been proposed to be utilised in the context of system identification and learning-based control. They utilise presupposed Lipschitz properties in order to compute inferences over unobserved function values. Unfortunately, most of these approaches rely on exact knowledge about the input space metric as well as about the Lipschitz constant. Furthermore, existing techniques to estimate the Lipschitz constants from the data are not robust to noise or seem to be ad-hoc and typically are decoupled from the ultimate learning and prediction task. To overcome these limitations, we propose an approach for optimising parameters of the presupposed metrics by minimising validation set prediction errors. To avoid poor performance due to local minima, we propose to utilise Lipschitz properties of the optimisation objective to ensure global optimisation success. The resulting approach is a new flexible method for nonparametric black-box learning. We illustrate its competitiveness on a set of benchmark problems.

#### The unreasonable effectiveness of structured random orthogonal embeddings

Krzysztof Choromanski, Mark Rowland, Adrian Weller, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications.

#### Concrete dropout

Yarin Gal, Jiri Hron, Alex Kendall, 2017. (NeurIPS).

Abstract▼ URL

Dropout is used as a practical tool to obtain uncertainty estimates in large vision models and reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty estimates, a grid-search over the dropout probabilities is necessary—a prohibitive operation with large models, and an impossible one with RL. We propose a new dropout variant which gives improved performance and better calibrated uncertainties. Relying on recent developments in Bayesian deep learning, we use a continuous relaxation of dropout’s discrete masks. Together with a principled optimisation objective, this allows for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles. In RL this allows the agent to adapt its uncertainty dynamically as more data is observed. We analyse the proposed variant extensively on a range of tasks, and give insights into common practice in the field where larger dropout probabilities are often used in deeper model layers.

#### Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates

Shixiang Gu, Ethan Holly, Timothy Lillicrap, Sergey Levine, May 2017. (In IEEE International Conference on Robotics and Automation). SINGAPORE.

Abstract▼ URL

Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.

**Comment:** [Google Blogpost] [MIT Technology Review] [Video]

#### Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Sergey Levine, April 2017. (In 5th International Conference on Learning Representations). Toulon France.

Abstract▼ URL

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym’s MuJoCo continuous control environments.

#### Interpolated Policy Gradient: Merging On-Policy and Off-Policy Policy Gradient Estimation for Deep Reinforcement Learning

Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard Schölkopf, Sergey Levine, Dec 2017. (In Advances in Neural Information Processing Systems 31). Long Beach USA.

Abstract▼ URL

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.

#### Closed-form Inference and Prediction in Gaussian Process State-Space Models

Alessandro Davide Ialongo, Mark van der Wilk, Carl Edward Rasmussen, December 2017. (In NIPS Time Series Workshop 2017). Long Beach.

Abstract▼ URL

We examine an analytic variational inference scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic model for system identification and time-series modelling. Our approach performs variational inference over both the system states and the transition function. We exploit Markov structure in the true posterior, as well as an inducing point approximation to achieve linear time complexity in the length of the time series. Contrary to previous approaches, no Monte Carlo sampling is required: inference is cast as a deterministic optimisation problem. In a number of experiments, we demonstrate the ability to model non-linear dynamics in the presence of both process and observation noise as well as to impute missing information (e.g. velocities from raw positions through time), to de-noise, and to estimate the underlying dimensionality of the system. Finally, we also introduce a closed-form method for multi-step prediction, and a novel criterion for assessing the quality of our approximate posterior.

#### Categorical Reparametrization with Gumble-Softmax

Eric Jang, Shixiang Gu, Ben Poole, April 2017. (In 5th International Conference on Learning Representations). Toulon FRANCE.

Abstract▼ URL

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.

#### Sequence Tutor: Conservative fine-tuning of sequence generation models with KL-control

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernndez Lobato, Richard E. Turner, Douglas Eck, Aug 2017. (In 34th International Conference on Machine Learning). Sydney AUSTRALIA.

Abstract▼ URL

This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higher-quality outputs that account for domain-specific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data.

**Comment:** [MIT Technology Review] [Video]

#### Avoiding Discrimination through Causal Reasoning

Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, Bernhard Schölkopf, December 2017. (In Advances in Neural Information Processing Systems 30). Long Beach, California.

Abstract▼ URL

Recent work on fairness in machine learning has focused on various statistical discrimination criteria and how they trade off. Most of these criteria are observational: They depend only on the joint distribution of predictor, protected attribute, features, and outcome. While convenient to work with, observational criteria have severe inherent limitations that prevent them from resolving matters of fairness conclusively. Going beyond observational criteria, we frame the problem of discrimination based on protected attributes in the language of causal reasoning. This viewpoint shifts attention from “What is the right fairness criterion?” to “What do we want to assume about our model of the causal data generating process?” Through the lens of causality, we make several contributions. First, we crisply articulate why and when observational criteria fail, thus formalizing what was before a matter of opinion. Second, our approach exposes previously ignored subtleties and why they are fundamental to the problem. Finally, we put forward natural causal non-discrimination criteria and develop algorithms that satisfy them.

#### Using Inertial Sensors for Position and Orientation Estimation

Manon Kok, Jeroen D. Hol, Thomas B. Schön, 2017. (Foundations and Trends in Signal Processing).

Abstract▼ URL

In recent years, MEMS inertial sensors (3D accelerometers and 3D gyroscopes) have become widely available due to their small size and low cost. Inertial sensor measurements are obtained at high sampling rates and can be integrated to obtain position and orientation information. These estimates are accurate on a short time scale, but suffer from integration drift over longer time scales. To overcome this issue, inertial sensors are typically combined with additional sensors and models. In this tutorial we focus on the signal processing aspects of position and orientation estimation using inertial sensors. We discuss different modeling choices and a selected number of important algorithms. The algorithms include optimization-based smoothing and filtering as well as computationally cheaper extended Kalman filter and complementary filter implementations. The quality of their estimates is illustrated using both experimental and simulated data.

**Comment:** arXiv

#### Dropout Inference in Bayesian Neural Networks with Alpha-divergences

Yingzhen Li, Yarin Gal, Aug 2017. (In 34th International Conference on Machine Learning). Sydney AUSTRALIA.

Abstract▼ URL

To obtain uncertainty estimates with real-world Bayesian deep learning models, practical inference approximations are needed. Dropout variational inference (VI) for example has been used for machine vision and medical applications, but VI can severely underestimates model uncertainty. Alpha-divergences are alternative divergences to VI’s KL objective, which are able to avoid VI’s uncertainty underestimation. But these are hard to use in practice: existing techniques can only use Gaussian approximating distributions, and require existing models to be changed radically, thus are of limited use for practitioners. We propose a re-parametrisation of the alpha-divergence objectives, deriving a simple inference technique which, together with dropout, can be easily implemented with existing models by simply changing the loss of the model. We demonstrate improved uncertainty estimates and accuracy compared to VI in dropout networks. We study our model’s epistemic uncertainty far away from the data using adversarial images, showing that these can be distinguished from non-adversarial images by examining our model’s uncertainty.

#### Learning-based Nonlinear Model Predictive Control

Daniel Limon, Jan-Peter Calliess, Jan Maciejowski, July 2017. (In IFAC 2017 World Congress). Toulouse, France. **DOI**: 10.1016/j.ifacol.2017.08.1050.

Abstract▼

This paper presents stabilizing Model Predictive Controllers (MPC) in which prediction models are inferred from experimental data of the inputs and outputs of the plant. Using a nonparametric machine learning technique called LACKI, the estimated (possibly nonlinear) model function together with an estimation of Hoelder constant is provided. Based on these, a number of predictive controllers with stability guaranteed by design are proposed. Firstly, the case when the prediction model is estimated off- line is considered and robust stability and recursive feasibility is ensured by using tightened constraints in the optimisation problem. This controller has been extended to the more interesting and complex case: the online learning of the model, where the new data collected from feedback is added to enhance the prediction model. A on-line learning MPC based on a double sequence of predictions is proposed. Stability of the online learning MPC is proved. These controllers are illustrated by simulation.

#### General Bayesian inference schemes in infinite mixture models

Maria Lomeli, 2017. University College London,Gatsby Unit, London, UK.

Abstract▼ URL

Bayesian statistical models allow us to formalise our knowledge about the world and reason about our uncertainty, but there is a need for better procedures to accurately encode its complexity. One way to do so is through compositional models, which are formed by combining blocks consisting of simpler models. One can increase the complexity of the compositional model by either stacking more blocks or by using a not-so-simple model as a building block. This thesis is an example of the latter. One first aim is to expand the choice of Bayesian nonparametric (BNP) blocks for constructing tractable compositional models. So far, most of the models that have a Bayesian nonparametric component use a Dirichlet Process or a Pitman-Yor process because of the availability of tractable and compact representations. This thesis shows how to overcome certain intractabilities in order to obtain analogous compact representations for the class of Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes. A major impediment to the widespread use of Bayesian nonparametric building blocks is that inference is often costly, intractable or difficult to carry out. This is an active research area since dealing with the model’s infinite dimensional component forbids the direct use of standard simulation-based methods. The main contribution of this thesis is a variety of inference schemes that tackle this problem: Markov chain Monte Carlo and Sequential Monte Carlo methods, which are exact inference schemes since they target the true posterior. The contributions of this thesis, in a larger context, provide general purpose exact inference schemes in the flavour or probabilistic programming: the user is able to choose from a variety of models, focusing only on the modelling part. Indeed, if the wide enough class of Poisson-Kingman priors is used as one of our blocks, this objective is achieved.

#### A marginal sampler for sigma-Stable Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, 2017. (Journal of Computational and Graphical Statistics).

Abstract▼ URL

We investigate the class of sigma-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs, which encompasses most of the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process, and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterizations of sigma-stable Poisson-Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for performing posterior inference with a Bayesian nonparametric mixture model. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a small number of auxiliary variables per iteration. We apply our sampling scheme to a density estimation and clustering tasks with unidimensional and multidimensional datasets, and compare it against competing MCMC sampling schemes. Supplementary materials for this article are available online.

#### Sample-then-optimise posterior sampling for Bayesian linear models

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, Zoubin Ghahramani, 2017. (AABI (NeurIPS workshop)).

Abstract▼ URL

In modern machine learning it is common to train models which have an extremely high intrinsic capacity. The results obtained are often i nitialization dependent, are different for disparate optimizers and in some cases have no explicit regularization. This raises difficult questions about generalization. A natural approach to questions of generalization is a Bayesian one. There is therefore a growing literature attempting to understand how Bayesian posterior inference could emerge from the complexity of modern practice, even without having such a procedure as the stated goal. In this work we consider a simple special case where exact Bayesian posterior sampling emerges from sampling (cf initialization) and then gradient descent. Specifically, for a Bayesian linear model, if we parameterize it in terms of a deterministic function of an isotropic normal prior, then the action of sampling from the prior followed by first order optimization of the squared loss will give a posterior sample. Although the assumptions are stronger than many real problems, it still exhibits the challenging properties of redundant model capacity and a lack of explicit regularizers, along with initialization and optimizer dependence. It is therefore an interesting controlled test case. Given its simplicity, the method itself may turn out to be of independent interest from our original goal.

#### Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning,

Rowan McAllister, Yarin Gal, Alex Kendall, Mark van der Wilk, Amar Shah, Roberto Cipolla, Adrian Weller, August 2017. (In International Joint Conference on Artificial Intelligence). Melbourne, Australia.

Abstract▼ URL

Autonomous vehicle (AV) software is typically composed of a pipeline of individual components, linking sensor inputs to motor outputs. Erroneous component outputs propagate downstream, hence safe AV software must consider the ultimate effect of each component’s errors. Further, improving safety alone is not sufficient. Passengers must also feel safe to trust and use AV systems. To address such concerns, we investigate three under-explored themes for AV research: safety, interpretability, and compliance. Safety can be improved by quantifying the uncertainties of component outputs and propagating them forward through the pipeline. Interpretability is concerned with explaining what the AV observes and why it makes the decisions it does, building reassurance with the passenger. Compliance refers to maintaining some control for the passenger. We discuss open challenges for research within these themes. We highlight the need for concrete evaluation metrics, propose example problems, and highlight possible solutions.

#### Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Rowan McAllister, Carl Edward Rasmussen, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.

#### Conditions beyond treewidth for tightness of higher-order LP relaxations

Mark Rowland, Aldo Pacchiano, Adrian Weller, April 2017. (In 20th International Conference on Artificial Intelligence and Statistics). Fort Lauderdale, Florida.

Abstract▼ URL

Linear programming (LP) relaxations are a popular method to attempt to find a most likely configuration of a discrete graphical model. If a solution to the relaxed problem is obtained at an integral vertex then the solution is guaranteed to be exact and we say that the relaxation is tight. We consider binary pairwise models and introduce new methods which allow us to demonstrate refined conditions for tightness of LP relaxations in the Sherali-Adams hierarchy. Our results include showing that for higher order LP relaxations, treewidth is not precisely the right way to characterize tightness. This work is primarily theoretical, with insights that can improve efficiency in practice.

#### Uprooting and rerooting higher-order graphical models

Mark Rowland, Adrian Weller, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

The idea of uprooting and rerooting graphical models was introduced specifically for binary pairwise models by Weller [18] as a way to transform a model to any of a whole equivalence class of related models, such that inference on any one model yields inference results for all others. This is very helpful since inference, or relevant bounds, may be much easier to obtain or more accurate for some model in the class. Here we introduce methods to extend the approach to models with higher-order potentials and develop theoretical insights. For example, we demonstrate that the triplet-consistent polytope TRI is unique in being ‘universally rooted’. We demonstrate empirically that rerooting can significantly improve accuracy of methods of inference for higher-order models at negligible computational cost.

#### On orientation estimation using iterative methods in Euclidean space

Martin A. Skoglund, Zoran Sjanic, Manon Kok, July 2017. (In Proceedings of the 20th International Conference on Information Fusion). Xi'an, China. **DOI**: 10.23919/ICIF.2017.8009830.

Abstract▼ URL

This paper presents three iterative methods for orientation estimation. The first two are based on iterated Extended Kalman filter (IEKF) formulations with different state representations. The first is using the well-known unit quaternion as state (q-IEKF) while the other is using orientation deviation which we call IMEKF. The third method is based on nonlinear least squares (NLS) estimation of the angular velocity which is used to parametrise the orientation. The results are obtained using Monte Carlo simulations and the comparison is done with the non-iterative EKF and multiplicative EKF (MEKF) as baseline. The result clearly shows that the IMEKF and the NLS-based method are superior to q-IEKF and all three outperform the non-iterative methods.

#### Safe semi-supervised learning of sum-product networks

Martin Trapp, Tamas Madl, Robert Peharz, Franz Pernkopf, Robert Trappl, August 2017. (In 33st Conference on Uncertainty in Artificial Intelligence). Sidney, Australia.

Abstract▼ URL

In several domains obtaining class annotations is expensive while at the same time unlabelled data are abundant. While most semi-supervised approaches enforce restrictive assumptions on the data distribution, recent work has managed to learn semi-supervised models in a non-restrictive regime. However, so far such approaches have only been proposed for linear models. In this work, we introduce semi-supervised parameter learning for Sum-Product Networks (SPNs). SPNs are deep probabilistic models admitting inference in linear time in number of network edges. Our approach has several advantages, as it (1) allows generative and discriminative semi-supervised learning, (2) guarantees that adding unlabelled data can increase, but not degrade, the performance (safe), and (3) is computationally efficient and does not enforce restrictive assumptions on the data distribution. We show on a variety of data sets that safe semi-supervised learning with SPNs is competitive compared to state-of-the-art and can lead to a better generative and discriminative objective value than a purely supervised approach.

#### Magnetic Hamiltonian Monte Carlo

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, Richard E. Turner, 2017. (In 34th International Conference on Machine Learning).

Abstract▼ URL

Hamiltonian Monte Carlo (HMC) exploits Hamiltonian dynamics to construct efficient proposals for Markov chain Monte Carlo (MCMC). In this paper, we present a generalization of HMC which exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the mechanics of a charged particle coupled to a magnetic field. We establish a theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC, and construct a symplectic, leapfrog-like integrator allowing for the implementation of magnetic HMC. Finally, we exhibit several examples where these non-canonical dynamics can lead to improved mixing of magnetic HMC relative to ordinary HMC.

#### Convolutional Gaussian Processes

Mark van der Wilk, Carl Edward Rasmussen, James Hensman, 2017. (In Advances in Neural Information Processing Systems 31).

Abstract▼ URL

We present a practical way of introducing convolutional structure into Gaussian processes, making them more suited to high-dimensional inputs like images. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10, which have both been known to be challenging for Gaussian processes. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. We hope that this illustration of the usefulness of a marginal likelihood will help automate discovering architectures in larger models.

**Comment:** arXiv

#### From parity to preference: Learning with cost-effective notions of fairness

M. B. Zafar, Isabel Valera, Manuel Rodriguez, Krishna P. Gummadi, Adrian Weller, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

The adoption of automated, data-driven decision making in an ever expanding range of applications has raised concerns about its potential unfairness towards certain social groups. In this context, a number of recent studies have focused on defining, detecting, and removing unfairness from data-driven decision systems. However, the existing notions of fairness, based on parity (equality) in treatment or outcomes for different social groups, tend to be needlessly stringent, limiting the overall decision making accuracy. In this paper, we draw inspiration from the fair-division and envy-freeness literature in economics and game theory and propose preference-based notions of fairness —- given the choice between various sets of decision treatments or outcomes, any group of users would collectively prefer its treatment or outcomes, regardless of the (dis)parity as compared to the other groups. Then, we introduce tractable proxies to design convex margin-based classifiers that satisfy these preference-based notions of fairness. Finally, we experiment with a variety of synthetic and real-world datasets and show that preference-based fairness allows for greater decision accuracy than parity-based fairness.

## 2016

#### The Mondrian Kernel

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, Yee Whye Teh, June 2016. (In 32nd Conference on Uncertainty in Artificial Intelligence). Jersey City, New Jersey, USA.

Abstract▼ URL

We introduce the Mondrian kernel, a fast random feature approximation to the Laplace kernel. It is suitable for both batch and online learning, and admits a fast kernel-width-selection procedure as the random features can be re-used efficiently for all kernel widths. The features are constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and we highlight the connection to Mondrian forests [Lakshminarayanan et al., 2014], where trees are also sampled via a Mondrian process, but fit independently. This link provides a new insight into the relationship between kernel methods and random forests.

**Comment:** [Supplementary Material] [arXiv] [Poster] [Slides] [Code]

#### Understanding Probabilistic Sparse Gaussian Process Approximations

Matthias Stephan Bauer, Mark van der Wilk, Carl Edward Rasmussen, 2016. (In Advances in Neural Information Processing Systems 29).

Abstract▼ URL

Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.

**Comment:** arXiv

#### Fabular: Regression Formulas As Probabilistic Programming

Johannes Borgström, Andrew D. Gordon, Long Ouyang, Claudio Russo, Adam Ścibior, Marcin Szymczak, 2016. (In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages). New York, NY, USA. St. Petersburg, FL, USA. acm. POPL 2016. **DOI**: 10.1145/2837614.2837653. **ISBN**: 978-1-4503-3549-2. **ACM ID**: 2837653.

Abstract▼ URL

Regression formulas are a domain-specific language adopted by several R packages for describing an important and useful class of statistical models: hierarchical linear regressions. Formulas are succinct, expressive, and clearly popular, so are they a useful addition to probabilistic programming languages? And what do they mean? We propose a core calculus of hierarchical linear regression, in which regression coefficients are themselves defined by nested regressions (unlike in R). We explain how our calculus captures the essence of the formula DSL found in R. We describe the design and implementation of Fabular, a version of the Tabular schema-driven probabilistic programming language, enriched with formulas based on our regression calculus. To the best of our knowledge, this is the first formal description of the core ideas of R’s formula notation, the first development of a calculus of regression formulas, and the first demonstration of the benefits of composing regression formulas and latent variables in a probabilistic programming language.

#### Deep Gaussian Processes for Regression using Approximate Expectation Propagation

Thang D. Bui, Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Yingzhen Li, Richard E. Turner, June 2016. (In 33rd International Conference on Machine Learning). New York, USA.

Abstract▼ URL

Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisations of Gaussian processes (GPs) and are formally equivalent to neural networks with multiple, infinitely wide hidden layers. DGPs are nonparametric probabilistic models and as such are arguably more flexible, have a greater capacity to generalise, and provide better calibrated uncertainty estimates than alternative deep models. This paper develops a new approximate Bayesian learning scheme that enables DGPs to be applied to a range of medium to large scale regression problems for the first time. The new method uses an approximate Expectation Propagation procedure and a novel and efficient extension of the probabilistic backpropagation algorithm for learning. We evaluate the new method for non-linear regression on eleven real-world datasets, showing that it always outperforms GP regression and is almost always better than state-of-the-art deterministic and sampling-based approximate inference methods for Bayesian neural networks. As a by-product, this work provides a comprehensive analysis of six approximate Bayesian methods for training neural networks.

#### Lazily Adapted Constant Kinky Inference for Nonparametric Regression and Model-Reference Adaptive Control

Jan-Peter Calliess, 2016. (arXiv).

Abstract▼ URL

Techniques known as Nonlinear Set Membership prediction, Lipschitz Interpolation or Kinky Inference are approaches to machine learning that utilise presupposed Lipschitz properties to compute inferences over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori they offer convergence guarantees as well as bounds around the predictions. Considering a more general setting that builds on Hölder continuity relative to pseudo-metrics, we propose an online method for estimating the Hoelder constant online from function value observations that possibly are corrupted by bounded observational errors. Utilising this to compute adaptive parameters within a kinky inference rule gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function in the limit of increasingly dense data to within a worst-case error bound that depends on the level of observational uncertainty. We apply our method in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks. For discrete-time systems, we provide stability guarantees for our learning-based controllers both for the batch and the online learning setting.

#### A Distributed Mechanism for Multi-Agent Convex Optimisation and Coordination with No-Regret Learners

Jan-Peter Calliess, Nathan Korda, Geoffrey J. Gordon, December 2016. (In Workshop on Learning, Inference and Control of Multi-Agent Systems, NIPS). Barcelona, Spain.

Abstract▼ URL

We develop an indirect mechanism for coordinated, distributed multi-agent optimisation, and decision-making. Our approach extends previous work in no-regret learning based mechanism design and renders it applicable to partial information settings. We consider planning problems that can be stated as a collection of single-agent convex programmes coupled by common soft constraints. A key idea is to recast the joint optimisation problem as distributed learning in a repeated game between the original agents and a newly introduced group of adversarial agents who influence prices for decisions and facilitate coordination. Under the weak behavioural assumption that all agents employ selfish, sub-linear regret algorithms in the course of the repeated game, we guarantee that our mechanism can achieve design goals such as social optimality (efficiency) and Nash-equilibrium convergence to within an error which approaches zero as the agents gain experience. Our error bounds are deterministic or probabilistic, depending on the nature of the regret bounds available for the algorithms employed by the agents. We illustrate our method in an emissions market application.

#### Manifold Gaussian Processes for Regression

Roberto Calandra, Jan Peters, Carl Edward Rasmussen, Marc Peter Deisenroth, 2016. (In International Joint Conference on Neural Networks).

Abstract▼ URL

Off-the-shelf Gaussian Process (GP) covariance functions encode smoothness assumptions on the structure of the function to be modeled. To model complex and nondifferentiable functions, these smoothness assumptions are often too restrictive. One way to alleviate this limitation is to find a different representation of the data by introducing a feature space. This feature space is often learned in an unsupervised way, which might lead to data representations that are not useful for the overall regression task. In this paper, we propose Manifold Gaussian Processes, a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space. The Manifold GP is a full GP and allows to learn data representations, which are useful for the overall regression task. As a proof-of-concept, we evaluate our approach on complex non-smooth functions where standard GPs perform poorly, such as step functions and robotics tasks with contacts.

#### Bayesian generalised ensemble Markov chain Monte Carlo

Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼

Bayesian generalised ensemble (BayesGE) is a new method that addresses two major drawbacks of standard Markov chain Monte Carlo algorithms for inference in high-dimensional probability models: inapplicability to estimate the partition function, and poor mixing properties. BayesGE uses a Bayesian approach to iteratively update the belief about the density of states (distribution of the log likelihood under the prior) for the model, with the dual purpose of enhancing the sampling efficiency and make the estimation of the partition function tractable. We benchmark BayesGE on Ising and Potts systems and show that it compares favourably to existing state-of-the-art methods.

#### MuProp: Unbiased Backpropagation for Stochastic Neural Networks

Shixiang Gu, Sergey Levine, Ilya Sutskever, Andriy Mnih, May 2016. (In 4th International Conference on Learning Representations). San Juan PUERTO RICO.

Abstract▼ URL

Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. Our experiments on structured output prediction and discrete latent variable modeling demonstrate that MuProp yields consistently good performance across a range of difficult tasks.

#### Continuous Deep Q-Learning with Model-based Acceleration

Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, Sergey Levine, June 2016. (In 33rd International Conference on Machine Learning). New York USA.

Abstract▼ URL

Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

#### Black-Box Alpha Divergence Minimization

José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui, Daniel Hernández-Lobato, Richard E. Turner, June 2016. (In 33rd International Conference on Machine Learning). New York USA.

Abstract▼ URL

Black-box alpha (BB-α) is a new approximate inference method based on the minimization of α-divergences. BB-α scales to large datasets because it can be implemented using stochastic gradient descent. BB-α can be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter α, the method is able to interpolate between variational Bayes (VB) (α→ 0) and an algorithm similar to expectation propagation (EP) (α = 1). Experiments on probit regression and neural network regression and classification problems show that BB-αwith non-standard settings of α, such as α = 0.5, usually produces better predictions than with α→ 0 (VB) or α = 1 (EP).

#### Rényi Divergence Variational Inference

Yingzhen Li, Richard E. Turner, Dec 2016. (In Advances in Neural Information Processing Systems 29). Barcelona SPAIN.

Abstract▼ URL

This paper introduces the variational Rényi bound (VR) that extends traditional variational inference to Rényi’s alpha-divergences. This new family of variational methods unifies a number of existing approaches, and enables a smooth interpolation from the evidence lower-bound to the log (marginal) likelihood that is controlled by the value of alpha that parametrises the divergence. The reparameterization trick, Monte Carlo approximation and stochastic optimisation methods are deployed to obtain a tractable and unified framework for optimisation. We further consider negative alpha values and propose a novel variational inference method as a new special case in the proposed framework. Experiments on Bayesian neural networks and variational auto-encoders demonstrate the wide applicability of the VR bound.

#### On Sparse Variational methods and the Kullback-Leibler divergence between stochastic processes

Alexander G D G Matthews, James Hensman, Richard E. Turner, Zoubin Ghahramani, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼ URL

The variational framework for learning inducing variables (Titsias, 2009a) has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes.

#### Bayesian Learning for Data-Efficient Control

Rowan McAllister, 2016. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Applications to learn control of unfamiliar dynamical systems with increasing autonomy are ubiquitous. From robotics, to finance, to industrial processing, autonomous learning helps obviate a heavy reliance on experts for system identification and controller design. Often real world systems are nonlinear, stochastic, and expensive to operate (e.g. slow, energy intensive, prone to wear and tear). Ideally therefore, nonlinear systems can be identified with minimal system interaction. This thesis considers data efficient autonomous learning of control of nonlinear, stochastic systems. Data efficient learning critically requires probabilistic modelling of dynamics. Traditional control approaches use deterministic models, which easily overfit data, especially small datasets. We use probabilistic Bayesian modelling to learn systems from scratch, similar to the PILCO algorithm, which achieved unprecedented data efficiency in learning control of several benchmarks. We extend PILCO in three principle ways. First, we learn control under significant observation noise by simulating a filtered control process using a tractably analytic framework of Gaussian distributions. In addition, we develop the `latent variable belief Markov decision process’ when filters must predict under real-time constraints. Second, we improve PILCO’s data efficiency by directing exploration with predictive loss uncertainty and Bayesian optimisation, including a novel approximation to the Gittins index. Third, we take a step towards data efficient learning of high-dimensional control using Bayesian neural networks (BNN). Experimentally we show although filtering mitigates adverse effects of observation noise, much greater performance is achieved when optimising controllers with evaluations faithful to reality: by simulating closed-loop filtered control if executing closed-loop filtered control. Thus, controllers are optimised w.r.t. how they are used, outperforming filters applied to systems optimised by unfiltered simulations. We show directed exploration improves data efficiency. Lastly, we show BNN dynamics models are almost as data efficient as Gaussian process models. Results show data efficient learning of high-dimensional control is possible as BNNs scale to high-dimensional state inputs.

#### Train and Test Tightness of LP Relaxations in Structured Prediction

Ofer Meshi, Mehrdad Mahdavi, Adrian Weller, David Sontag, June 2016. (In 33rd International Conference on Machine Learning). New York, NY.

Abstract▼ URL

Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming (LP) relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data.

#### Consistent Kernel Mean Estimation for Functions of Random Variables

Carl-Johann Simon-Gabriel, Adam Ścibior, Ilya Tolstikhin, Bernhard Schölkopf, 2016. (In Advances in Neural Information Processing Systems 30).

Abstract▼ URL

We provide a theoretical foundation for non-parametric estimation of functions of random variables using kernel mean embeddings. We show that for any continuous function f, consistent estimators of the mean embedding of a random variable X lead to consistent estimators of the mean embedding of f(X). For Matérn kernels and sufficiently smooth functions we also provide rates of convergence. Our results extend to functions of multiple random variables. If the variables are dependent, we require an estimator of the mean embedding of their joint distribution as a starting point; if they are independent, it is sufficient to have separate estimators of the mean embeddings of their marginal distributions. In either case, our results cover both mean embeddings based on i.i.d. samples as well as “reduced set” expansions in terms of dependent expansion points. The latter serves as a justification for using such expansions to limit memory resources when applying the approach as a basis for probabilistic programming.

#### Compressing combinatorial objects

Christian Steinruecken, January 2016.

Abstract▼ URL

Most of the world’s digital data is currently encoded in a sequential form, and compression methods for sequences have been studied extensively. However, there are many types of non-sequential data for which good compression techniques are still largely unexplored. This paper contributes insights and concrete techniques for compressing various kinds of non-sequential data via arithmetic coding, and derives re-usable probabilistic data models from fairly generic structural assumptions. Near-optimal compression methods are described for certain types of permutations, combinations and multisets; and the conditions for optimality are made explicit for each method.

#### Uprooting and Rerooting Graphical Models

Adrian Weller, June 2016. (In 33rd International Conference on Machine Learning). New York, NY.

Abstract▼ URL

We show how any binary pairwise model may be ‘uprooted’ to a fully symmetric model, wherein original singleton potentials are transformed to potentials on edges to an added variable, and then ‘rerooted’ to a new model on the original number of variables. The new model is essentially equivalent to the original model, with the same partition function and allowing recovery of the original marginals or a MAP configuration, yet may have very different computational properties that allow much more efficient inference. This meta-approach deepens our understanding, may be applied to any existing algorithm to yield improved methods in practice, generalizes earlier theoretical results, and reveals a remarkable interpretation of the triplet-consistent polytope.

#### Characterizing Tightness of LP Relaxations by Forbidding Signed Minors

Adrian Weller, June 2016. (In 32nd Conference on Uncertainty in Artificial Intelligence). Jersey City, NJ.

Abstract▼ URL

We consider binary pairwise graphical models and provide an exact characterization (necessary and sufficient conditions observing signs of potentials) of tightness for the LP relaxation on the triplet-consistent polytope of the MAP inference problem, by forbidding an odd-K5 (complete graph on 5 variables with all edges repulsive) as a signed minor in the signed suspension graph. This captures signs of both singleton and edge potentials in a compact and efficiently testable condition, and improves significantly on earlier results. We provide other results on tightness of LP relaxations by forbidding minors, draw connections and suggest paths for future research.

#### Clamping Improves TRW and Mean Field Approximations

Adrian Weller, Justin Domke, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼ URL

We examine the effect of clamping variables for approximate inference in undirected graphical models with pairwise relationships and discrete variables. For any number of variable labels, we demonstrate that clamping and summing approximate sub-partition functions can lead only to a decrease in the partition function estimate for TRW, and an increase for the naive mean field method, in each case guaranteeing an improvement in the approximation and bound. We next focus on binary variables, add the Bethe approximation to consideration and examine ways to choose good variables to clamp, introducing new methods. We show the importance of identifying highly frustrated cycles, and of checking the singleton entropy of a variable. We explore the value of our methods by empirical analysis and draw lessons to guide practitioners.

#### Tightness of LP Relaxations for Almost Balanced Models

Adrian Weller, Mark Rowland, David Sontag, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼ URL

Linear programming (LP) relaxations are widely used to attempt to identify a most likely configuration of a discrete graphical model. In some cases, the LP relaxation attains an optimum vertex at an integral location and thus guarantees an exact solution to the original optimization problem. When this occurs, we say that the LP relaxation is tight. Here we consider binary pairwise models and derive sufﬁcient conditions for guaranteed tightness of (i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP relaxation on the triplet-consistent polytope LP+TRI (the next level in the Sherali-Adams hierarchy). We provide simple new proofs of earlier results and derive signiﬁcant novel results including that LP+TRI is tight for any model where each block is balanced or almost balanced, and a decomposition theorem that may be used to break apart complex models into smaller pieces. An almost balanced (sub-)model is one that contains no frustrated cycles except through one privileged variable.

## 2015

#### Bayesian Lipschitz Constant Estimation and Quadrature

Jan-Peter Calliess, December 2015. (In Workshop on Probabilistic Integration, NIPS). Montreal, Canada.

Abstract▼ URL

Lipschitz quadrature methods provide an approach to one-dimensional numerical integration on bounded domains. On the basis of the assumption that the integrand is Lipschitz continuous with a known Lipschitz constant, these quadrature rules can provide a tight error bound around their integral estimates and utilise the Lipschitz constant to guide exploration in the context of adaptive quadrature. In this paper, we outline our ongoing work on extending this approach to settings where the Lipschitz constant is probabilistically uncertain. As the key component, we introduce a Bayesian approach for updating a subjectively probabilistic belief of the Lipschitz constant. Combined with any Lipschitz quadrature rule, we obtain an approach for translating a sample into an integral estimate with probabilistic uncertainty intervals. The paper concludes with an illustration of the approach followed by a discussion of open issues and future work.

#### Effective implementation of Gaussian process regression for machine learning

Alex Davies, 2015. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

This thesis presents frameworks for the effective implementation of Gaussian process regression for machine learning. It addresses this in three parts: effective iterative methods for calculating the predictive distribution and derivatives of a Gaussian process with fixed hyper-parameters, defining three broad classes of kernels of controllable complexity that allow for an order of magnitude scaling in the previous framework and an investigation into alternative objective functions and improved derivatives for the optimization of model hyper-parameters.

#### Gaussian Processes for Data-Efficient Learning in Robotics and Control

Marc Peter Deisenroth, Dieter Fox, Carl Edward Rasmussen, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence). **DOI**: 10.1109/TPAMI.2013.218.

Abstract▼

Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-speciﬁc knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or speciﬁc knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.

#### Training generative neural networks via Maximum Mean Discrepancy optimization

Gintare Karolina Dziugaite, Daniel M. Roy, Zoubin Ghahramani, July 2015. (In 31st Conference on Uncertainty in Artificial Intelligence). Amsterdam, The Netherlands.

Abstract▼ URL

We consider training a deep neural network to generate samples from an unknown distribution given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic—informally speaking, a good generator network produces samples that cause a two-sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an unbiased estimate of the maximum mean discrepancy, which is the centerpiece of the nonparametric kernel two-sample test proposed by Gretton et al. (2012). We compare to the adversarial nets framework introduced by Goodfellow et al. (2014), in which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this perspective, the MMD statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical MMD.

#### On a class of sigma-Stable Poisson-Kingman models and an effective marginalised sampler

Stefano Favaro, Maria Lomeli, Yee Whye Teh, 2015. (Statistics and Computing).

Abstract▼ URL

We investigate the use of a large class of discrete random probability measures, which is referred to as the class Q, , in the context of Bayesian nonparametric mixture modeling. The class Q encompasses both the the two-parameter Poisson?Dirichlet process and the normalized generalized Gamma process, thus allowing us to comparatively study the inferential advantages of these two well-known nonparametric priors. Apart from ahighly flexible parameterization, the distinguishing feature of the class Q is the availability of a tractable posterior distribution. This feature, in turn, leads to derive an efficient marginal MCMC algorithm for posterior sampling within the framework of mixture models. We demonstrate the efficacy of our modeling framework on both one-dimensional and multi-dimensional datasets.

#### Bayesian Time Series Learning with Gaussian Processes

Roger Frigola, 2015. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

The analysis of time series data is important in fields as disparate as the social sciences, biology, engineering or econometrics. In this dissertation, we present a number of algorithms designed to learn Bayesian nonparametric models of time series. The goal of these kinds of models is twofold. First, they aim at making predictions which quantify the uncertainty due to limitations in the quantity and the quality of the data. Second, they are flexible enough to model highly complex data whilst preventing overfitting when the data does not warrant complex models. We begin with a unifying literature review on time series models based on Gaussian processes. Then, we centre our attention on the Gaussian Process State-Space Model (GP-SSM): a Bayesian nonparametric generalisation of discrete-time nonlinear state-space models. We present a novel formulation of the GP-SSM that offers new insights into its properties. We then proceed to exploit those insights by developing new learning algorithms for the GP-SSM based on particle Markov chain Monte Carlo and variational inference. Finally, we present a filtered nonlinear auto-regressive model with a simple, robust and fast learning algorithm that makes it well suited to its application by non-experts on large datasets. Its main advantage is that it avoids the computationally expensive (and potentially difficult to tune) smoothing step that is a key part of learning nonlinear state-space models.

#### Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Yarin Gal, Yutian Chen, Zoubin Ghahramani, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.

#### Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs

Yarin Gal, Richard Turner, 2015. (In Proceedings of the 32nd International Conference on Machine Learning (ICML-15)).

Abstract▼ URL

Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random variable. The random covariance function has a posterior, on which a variational distribution is placed. The variational distribution transforms the random covariance function to fit the data. We study the properties of our approximate inference, compare it to alternative ones, and extend it to the distributed and stochastic domains. Our approximation captures complex functions better than standard approaches and avoids over-fitting.

#### Probabilistic machine learning and artificial intelligence

Zoubin Ghahramani, 2015. (Nature). **DOI**: doi:10.1038/nature14541.

Abstract▼ URL

How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.

#### Scaling Multidimensional Inference for Structured Gaussian Processes

E. Gilboa, Yunus Saatçi, John P. Cunningham, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence). **DOI**: 10.1109/TPAMI.2013.192.

Abstract▼

Exact Gaussian process (GP) regression has O(N3 runtime for data size N, making it intractable for large N. Many algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure inherent in particular covariance functions, including GPs with implied Markov structure, and inputs on a lattice (both enable O(N) or O(N log N) runtime). However, these GP advances have not been well extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests three novel extensions of structured GPs to multidimensional inputs, for models with additive and multiplicative kernels. First we present a new method for inference in additive GPs, showing a novel connection between the classic backfitting method and the Bayesian framework. We extend this model using two advances: a variant of projection pursuit regression, and a Laplace approximation for non-Gaussian observations. Lastly, for multiplicative kernel structure, we present a novel method for GPs with inputs on a multidimensional grid. We illustrate the power of these three advances on several data sets, achieving performance equal to or very close to the naive GP at orders of magnitude less cost.

**Comment:** arXiv

#### Neural Adaptive Sequential Monte Carlo

Shixiang Gu, Zoubin Ghahramani, Richard E. Turner, Dec 2015. (In Advances in Neural Information Processing Systems 29). Montréal CANADA.

Abstract▼ URL

Sequential Monte Carlo (SMC), or particle filtering, is a popular class of methods for sampling from an intractable target distribution using a sequence of simpler intermediate distributions. Like other importance sampling-based methods, performance is critically dependent on the proposal distribution: a bad proposal can lead to arbitrarily inaccurate estimates of the target distribution. This paper presents a new method for automatically adapting the proposal using an approximation of the Kullback-Leibler divergence between the true posterior and the proposal distribution. The method is very flexible, applicable to any parameterised proposal distribution and it supports online and batch variants. We use the new framework to adapt powerful proposal distributions with rich parameterisations based upon neural networks leading to Neural Adaptive Sequential Monte Carlo (NASMC). Experiments indicate that NASMC significantly improves inference in a non-linear state space model outperforming adaptive proposal methods including the Extended Kalman and Unscented Particle Filters. Experiments also indicate that improved inference translates into improved parameter learning when NASMC is used as a subroutine of Particle Marginal Metropolis Hastings. Finally we show that NASMC is able to train a neural network-based deep recurrent generative model achieving results that compete with the state-of-the-art for polymorphic music modelling. NASMC can be seen as bridging the gap between adaptive SMC methods and the recent work in scalable, black-box variational inference.

#### MCMC for Variationally Sparse Gaussian Processes

James Hensman, Alexander G D G Matthews, Maurizio Filippone, Zoubin Ghahramani, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly.

#### Scalable Variational Gaussian Process Classification

James Hensman, Alexander G D G Matthews, Zoubin Ghahramani, May 2015. (In 18th International Conference on Artificial Intelligence and Statistics). San Diego, California, USA.

Abstract▼ URL

Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, out-performing the state of the art on benchmark datasets. Importantly, the variational formulation an be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.

#### Predictive Entropy Search for Bayesian Optimization with Unknown Constraints

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman, Ryan P. Adams, Zoubin Ghahramani, 2015. (In 32nd International Conference on Machine Learning).

Abstract▼ URL

Unknown constraints arise in many types of expensive black-box optimization problems. Several methods have been proposed recently for performing Bayesian optimization with constraints, based on the expected improvement (EI) heuristic. However, EI can lead to pathologies when used with constraints. For example, in the case of decoupled constraints—i.e., when one can independently evaluate the objective or the constraints—EI can encounter a pathology that prevents exploration. Additionally, computing EI requires a current best solution, which may not exist if none of the data collected so far satisfy the constraints. By contrast, information-based approaches do not suffer from these failure modes. In this paper, we present a new information-based method called Predictive Entropy Search with Constraints (PESC). We analyze the performance of PESC and show that it compares favorably to EI-based approaches on synthetic and benchmark problems, as well as several real-world examples. We demonstrate that PESC is an effective algorithm that provides a promising direction towards a unified solution for constrained Bayesian optimization.

#### Unsupervised Many-to-Many Object Matching for Relational Data

Tomoharu Iwata, James Robert Lloyd, Zoubin Ghahramani, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence).

Abstract▼ URL

We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. The effectiveness of the proposed method is experimentally demonstrated by using synthetic and real relational data sets, which include applications to cross-domain recommendation without shared user/item identifiers and multi-lingual word clustering.

#### Stochastic Expectation Propagation

Yingzhen Li, José Miguel Hernández-Lobato, Richard E. Turner, Dec 2015. (In Advances in Neural Information Processing Systems 28). Montréal CANADA.

Abstract▼ URL

Expectation propagation (EP) is a deterministic approximation algorithm that is often used to perform approximate Bayesian parameter learning. EP approximates the full intractable posterior distribution through a set of local-approximations that are iteratively refined for each datapoint. EP can offer analytic and computational advantages over other approximations, such as Variational Inference (VI), and is the method of choice for a number of models. The local nature of EP appears to make it an ideal candidate for performing Bayesian learning on large models in large-scale datasets settings. However, EP has a crucial limitation in this context: the number approximating factors needs to increase with the number of data-points, N, which often entails a prohibitively large memory overhead. This paper presents an extension to EP, called stochastic expectation propagation (SEP), that maintains a global posterior approximation (like VI) but updates it in a local way (like EP ). Experiments on a number of canonical learning problems using synthetic and real-world datasets indicate that SEP performs almost as well as full EP, but reduces the memory consumption by a factor of N. SEP is therefore ideally suited to performing approximate Bayesian learning in the large model, large dataset setting.

#### Representation, learning, description and criticism of probabilistic models with applications to networks, functions and relational data

James Rovert Lloyd, 2015. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

This thesis makes contributions to a variety of aspects of probabilistic inference. When performing probabilistic inference, one must first represent one’s beliefs with a probability distribution. Specifying the details of a probability distribution can be a difficult task in many situations, but when expressing beliefs about complex data structures it may not even be apparent what form such a distribution should take. This thesis starts by demonstrating how representation theorems due to Aldous, Hoover and Kallenberg can be used to specify appropriate models for data in the form of networks. These theorems are then extended in order to reveal appropriate probability distributions for arbitrary relational data or databases. A simpler data structure to specify probability distributions for is that of functions; many probability distributions for functions have been used for centuries. We demonstrate that many of these distributions can be expressed in a common language of Gaussian process kernels constructed from a few base elements and operators. The structure of this language allows for the effective automatic construction of probabilistic models for functions. Furthermore, the formal mathematical language of kernels can be mapped neatly onto natural language allowing for automatic descriptions of the automatically constructed models. By further automating the construction of statistical models, the need to be able to effectively check or criticise these models becomes greater. This thesis demonstrates how kernel two sample tests can be used to demonstrate where a probabilistic model most disagrees with data allowing for targeted improvements to the model. In proposing a new method of model criticism this thesis also briefly discusses the philosophy of model criticism within the context of probabilistic inference.

#### Statistical Model Criticism using Kernel Two Sample Tests

James Robert Lloyd, Zoubin Ghahramani, December 2015. (In Advances in Neural Information Processing Systems 29). Montreal, Canada.

Abstract▼ URL

We propose an exploratory approach to statistical model criticism using maximum mean discrepancy (MMD) two sample tests. Typical approaches to model criticism require a practitioner to select a statistic by which to measure discrepancies between data and a statistical model. MMD two sample tests are instead constructed as an analytic maximisation over a large space of possible statistics and therefore automatically select the statistic which most shows any discrepancy. We demonstrate on synthetic data that the selected statistic, called the witness function, can be used to identify where a statistical model most misrepresents the data it was trained on. We then apply the procedure to real data where the models being assessed are restricted Boltzmann machines, deep belief networks and Gaussian process regression and demonstrate the ways in which these models fail to capture the properties of the data they are trained on.

#### A hybrid sampler for Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers.

#### A causal perspective on domain adaptation

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, Jonas Peters, 2015. (arXiv preprint arXiv:1507.05333)).

Abstract▼ URL

From training data from several related domains (or tasks), methods of domain adaptation try to combine knowledge to improve performance. This paper discusses an approach to domain adaptation which is inspired by a causal interpretation of the multi-task problem. We assume that a covariate shift assumption holds true for a subset of predictor variables: the conditional of the target variable given this subset of predictors is invariant with respect to shifts in those predictors (covariates). We propose to learn the corresponding conditional expectation in the training domains and use it for estimation in the target domain. We further introduce a method which allows for automatic inference of the above subset in regression and classification. We study the performance of this approach in an adversarial setting, in the case where no additional examples are available in the test domain. If a labeled sample is available, we provide a method for using both the transferred invariant conditional and task specific information. We present results on synthetic data sets and a sentiment analysis problem.

#### Practical Probabilistic Programming with Monads

Adam Ścibior, Zoubin Ghahramani, Andrew D. Gordon, 2015. (In Proceedings of the 8th ACM SIGPLAN Symposium on Haskell). Association for Computing Machinery. **DOI**: 10.1145/2804302.2804317.

Abstract▼ URL

The machine learning community has recently shown a lot of interest in practical probabilistic programming systems that target the problem of Bayesian inference. Such systems come in different forms, but they all express probabilistic models as computational processes using syntax resembling programming languages. In the functional programming community monads are known to offer a convenient and elegant abstraction for programming with probability distributions, but their use is often limited to very simple inference problems. We show that it is possible to use the monad abstraction to construct probabilistic models for machine learning, while still offering good performance of inference in challenging models. We use a GADT as an underlying representation of a probability distribution and apply Sequential Monte Carlo-based methods to achieve efficient inference. We define a formal semantics via measure theory. We demonstrate a clean and elegant implementation that achieves performance comparable with Anglican, a state-of-the-art probabilistic programming system.

#### Compressing Sets and Multisets of Sequences

Christian Steinruecken, March 2015. (IEEE Transactions on Information Theory). IEEE. **DOI**: 10.1109/TIT.2015.2392093. **ISSN**: 0018-9448. **Note**: A previous version was published at the Data Compression Conference 2014..

Abstract▼ URL

This article describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset’s unordered structure. Multisets are a generalisation of sets where members are allowed to occur multiple times. A multiset can be encoded naively by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.

#### Improving PPM with dynamic parameter updates

Christian Steinruecken, Zoubin Ghahramani, David MacKay, April 2015. (In Proceedings of the Data Compression Conference). Edited by Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, James A. Storer. Snowbird, UT, USA. IEEE Computer Society. **ISSN**: 1068-0314.

Abstract▼ URL

This article makes several improvements to the classic PPM algorithm, resulting in a new algorithm with superior compression effectiveness on human text. The key differences of our algorithm to classic PPM are that (A) rather than the original escape mechanism, we use a generalised blending method with explicit hyper-parameters that control the way symbol counts are combined to form predictions; (B) different hyper-parameters are used for classes of different contexts; and (C) these hyper-parameters are updated dynamically using gradient information. The resulting algorithm (PPM-DP) compresses human text better than all currently published variants of PPM, CTW, DMC, LZ, CSE and BWT, with runtime only slightly slower than classic PPM.

#### Learning Stationary Time Series using Gaussian Process with Nonparametric Kernels

Felipe Tobar, Thang D. Bui, Richard E. Turner, Dec 2015. (In Advances in Neural Information Processing Systems 29). Montréal CANADA.

Abstract▼ URL

We introduce the Gaussian Process Convolution Model (GPCM), a two-stage nonparametric generative procedure to model stationary signals as the convolution between a continuous-time white-noise process and a continuous-time linear filter drawn from Gaussian process. The GPCM is a continuous-time nonparametricwindow moving average process and, conditionally, is itself a Gaussian process with a nonparametric kernel defined in a probabilistic fashion. The generative model can be equivalently considered in the frequency domain, where the power spectral density of the signal is specified using a Gaussian process. One of the main contributions of the paper is to develop a novel variational freeenergy approach based on inter-domain inducing variables that efficiently learns the continuous-time linear filter and infers the driving white-noise process. In turn, this scheme provides closed-form probabilistic estimates of the covariance kernel and the noise-free signal both in denoising and prediction scenarios. Additionally, the variational inference procedure provides closed-form expressions for the approximate posterior of the spectral density given the observed data, leading to new Bayesian nonparametric approaches to spectrum estimation. The proposed GPCM is validated using synthetic and real-world signals.

#### Unsupervised State-Space Modeling Using Reproducing Kernels

Felipe Tobar, Petar M. Djurić, Danilo P. Mandic, 2015. (IEEE Transactions on Signal Processing).

Abstract▼ URL

A novel framework for the design of state-space models (SSMs) is proposed whereby the state-transition function of the model is parametrized using reproducing kernels. The nature of SSMs requires learning a latent function that resides in the state space and for which input-output sample pairs are not available, thus prohibiting the use of gradient-based supervised kernel learning. To this end, we then propose to learn the mixing weights of the kernel estimate by sampling from their posterior density using Monte Carlo methods. We first introduce an offline version of the proposed algorithm, followed by an online version which performs inference on both the parameters and the hidden state through particle filtering. The accuracy of the estimation of the state-transition function is first validated on synthetic data. Next, we show that the proposed algorithm outperforms kernel adaptive filters in the prediction of real-world time series, while also providing probabilistic estimates, a key advantage over standard methods.

#### High-Dimensional Kernel Regression: A Guide for Practitioners

Felipe Tobar, Danilo P. Mandic, 2015. (In Trends in Digital Signal Processing: A Festschrift in Honour of A.G. Constantinides). Edited by Y. C. Lim, H. K. Kwan, W.-C. Siu. CRC Press.

#### Design of Positive-Definite Quaternion Kernels

Felipe Tobar, Danilo P. Mandic, 2015. (IEEE Signal Processing Letters).

Abstract▼ URL

Quaternion reproducing kernel Hilbert spaces (QRKHS) have been proposed recently and provide a high-dimensional feature space (alternative to the real-valued multikernel approach) for general kernel-learning applications. The current challenge within quaternion-kernel learning is the lack of general quaternion-valued kernels, which are necessary to exploit the full advantages of the QRKHS theory in real-world problems. This letter proposes a novel way to design quaternion-valued kernels, this is achieved by transforming three complex kernels into quaternion ones and then combining their real and imaginary parts. Building on this general construction, our emphasis is on a new quaternion kernel of polynomial features, which is assessed in the prediction of bodysensor networks applications.

#### Modelling of Complex Signals using Gaussian Processes

Felipe Tobar, Richard E. Turner, 2015. (In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)).

Abstract▼ URL

In complex-valued signal processing, estimation algorithms require complete knowledge (or accurate estimation) of the second order statistics, this makes Gaussian processes (GP) well suited for modelling complex signals, as they are designed in terms of covariance functions. Dealing with bivariate signals using GPs require four covariance matrices, or equivalently, two complex matrices. We propose a GP-based approach for modelling complex signals, whereby the second-order statistics are learnt through maximum likelihood; in particular, the complex GP approach allows for circularity coefficient estimation in a robust manner when the observed signal is corrupted by (circular) white noise. The proposed model is validated using climate signals, for both circular and noncircular cases. The results obtained open new possibilities for collaboration between the complex signal processing and Gaussian processes communities towards an appealing representation and statistical description of bivariate signals.

#### Revisiting the limits of MAP inference by MWSS on perfect graphs

Adrian Weller, May 2015. (In 18th International Conference on Artificial Intelligence and Statistics). San Diego, California.

Abstract▼ URL

A recent, promising approach to identifying a configuration of a discrete graphical model with highest probability (termed MAP inference) is to reduce the problem to finding a maximum weight stable set (MWSS) in a derived weighted graph, which, if perfect, allows a solution to be found in polynomial time. Weller and Jebara (2013) investigated the class of binary pairwise models where this method may be applied. However, their analysis made a seemingly innocuous assumption which simplifies analysis but led to only a subset of possible reparameterizations being considered. Here we introduce novel techniques and consider all cases, demonstrating that this greatly expands the set of tractable models. We provide a simple, exact characterization of the new, enlarged set and show how such models may be efficiently identified, thus settling the power of the approach on this class.

**Comment:** Also accepted for presentation at the 21st International Conference on Principles and Practice of Constraint Programming (CP 2015)

#### Distributed Inference for Dirichlet Process Mixture Models

Hong Ge, Yutian Chen, Moquan Wan, Zoubin Ghahramani, 07–09 Jul 2015. (In Proceedings of the 32nd International Conference on Machine Learning). Edited by Francis Bach, David Blei. Lille, France. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Bayesian nonparametric mixture models based on the Dirichlet process (DP) have been widely used for solving problems like clustering, density estimation and topic modelling. These models make weak assumptions about the underlying process that generated the observed data. Thus, when more data are collected, the complexity of these models can change accordingly. These theoretical properties often lead to superior predictive performance when compared to traditional finite mixture models. However, despite the increasing amount of data available, the application of Bayesian nonparametric mixture models is so far limited to relatively small data sets. In this paper, we propose an efficient distributed inference algorithm for the DP and the HDP mixture model. The proposed method is based on a variant of the slice sampler for DPs. Since this sampler does not involve a pre-determined truncation, the stationary distribution of the sampling algorithm is unbiased. We provide both local thread-level and distributed machine-level parallel implementations and study the performance of this sampler through an extensive set of experiments on image and text data. When compared to existing inference algorithms, the proposed method exhibits state-of-the-art accuracy and strong scalability with up to 512 cores.

## 2014

#### Policy search for learning robot control using sparse data

B. Bischoff, D. Nguyen-Tuong, D. van Hoof, A. McHutchon, Carl Edward Rasmussen, A. Knoll, M. P. Deisenroth, 2014. (In IEEE International Conference on Robotics and Automation). Hong Kong, China. IEEE. **DOI**: 10.1109/ICRA.2014.6907422.

Abstract▼ URL

In many complex robot applications, such as grasping and manipulation, it is difficult to program desired task solutions beforehand, as robots are within an uncertain and dynamic environment. In such cases, learning tasks from experience can be a useful alternative. To obtain a sound learning and generalization performance, machine learning, especially, reinforcement learning, usually requires sufficient data. However, in cases where only little data is available for learning, due to system constraints and practical issues, reinforcement learning can act suboptimally. In this paper, we investigate how model-based reinforcement learning, in particular the probabilistic inference for learning control method (PILCO), can be tailored to cope with the case of sparse data to speed up learning. The basic idea is to include further prior knowledge into the learning process. As PILCO is built on the probabilistic Gaussian processes framework, additional system knowledge can be incorporated by defining appropriate prior distributions, e.g. a linear mean Gaussian prior. The resulting PILCO formulation remains in closed form and analytically tractable. The proposed approach is evaluated in simulation as well as on a physical robot, the Festo Robotino XT. For the robot evaluation, we employ the approach for learning an object pick-up task. The results show that by including prior knowledge, policy learning can be sped up in presence of sparse data.

#### Equilibrium simulations of proteins using molecular fragment replacement and NMR chemical shifts

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas Hamelryck, Kresten Lindorff-Larsen, Michele Vendruscolo, 2014. (Proceedings of the National Academy of Sciences). **DOI**: 10.1073/pnas.1404948111.

Abstract▼

Methods of protein structure determination based on NMR chemical shifts are becoming increasingly common. The most widely used approaches adopt the molecular fragment replacement strategy, in which structural fragments are repeatedly reassembled into different complete conformations in molecular simulations. Although these approaches are effective in generating individual structures consistent with the chemical shift data, they do not enable the sampling of the conformational space of proteins with correct statistical weights. Here, we present a method of molecular fragment replacement that makes it possible to perform equilibrium simulations of proteins, and hence to determine their free energy landscapes. This strategy is based on the encoding of the chemical shift information in a probabilistic model in Markov chain Monte Carlo simulations. First, we demonstrate that with this approach it is possible to fold proteins to their native states starting from extended structures. Second, we show that the method satisfies the detailed balance condition and hence it can be used to carry out an equilibrium sampling from the Boltzmann distribution corresponding to the force field used in the simulations. Third, by comparing the results of simulations carried out with and without chemical shift restraints we describe quantitatively the effects that these restraints have on the free energy landscapes of proteins. Taken together, these results demonstrate that the molecular fragment replacement strategy can be used in combination with chemical shift information to characterize not only the native structures of proteins but also their conformational fluctuations.

#### Scalable Gaussian Process Structured Prediction for Grid Factor Graph Applications

Sébastien Bratières, Novi Quadrianto, Sebastian Nowozin, Zoubin Ghahramani, 2014. (In 31st International Conference on Machine Learning).

Abstract▼ URL

Structured prediction is an important and well studied problem with many applications across machine learning. GPstruct is a recently proposed structured prediction model that offers appealing properties such as being kernelised, non-parametric, and supporting Bayesian inference (Bratières et al. 2013). The model places a Gaussian process prior over energy functions which describe relationships between input variables and structured output variables. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners (predictors) trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes.

#### Tree-structured Gaussian Process Approximations

Thang D. Bui, Richard E. Turner, 2014. (In Advances in Neural Information Processing Systems 28). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼

Gaussian process regression can be accelerated by constructing a small pseudo-dataset to summarize the observed data. This idea sits at the heart of many approximation schemes, but such an approach requires the number of pseudo-datapoints to be scaled with the range of the input space if the accuracy of the approximation is to be maintained. This presents problems in time-series settings or in spatial datasets where large numbers of pseudo-datapoints are required since computation typically scales quadratically with the pseudo-dataset size. In this paper we devise an approximation whose complexity grows linearly with the number of pseudo-datapoints. This is achieved by imposing a tree or chain structure on the pseudo-datapoints and calibrating the approximation using a Kullback-Leibler (KL) minimization. Inference and learning can then be performed efficiently using the Gaussian belief propagation algorithm. We demonstrate the validity of our approach on a set of challenging regression tasks including missing data imputation for audio and spatial datasets. We trace out the speed-accuracy trade-off for the new method and show that the frontier dominates those obtained from a large number of existing approximation techniques.

#### The Random Forest Kernel and other kernels for big data from random partitions

Alex Davies, Zoubin Ghahramani, 2014. (arXiv).

Abstract▼ URL

We present Random Partition Kernels, a new class of kernels derived by demonstrating a natural connection between random partitions of objects and kernels between those objects. We show how the construction can be used to create kernels from methods that would not normally be viewed as random partitions, such as Random Forest. To demonstrate the potential of this method, we propose two new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show that these kernels consistently outperform standard kernels on problems involving real-world datasets. Finally, we show how the form of these kernels lend themselves to a natural approximation that is appropriate for certain big data problems, allowing O(N) inference in methods such as Gaussian Processes, Support Vector Machines and Kernel PCA.

#### Avoiding pathologies in very deep networks

David Duvenaud, Oren Rippel, Ryan P. Adams, Zoubin Ghahramani, April 2014. (In 17th International Conference on Artificial Intelligence and Statistics). Reykjavik, Iceland.

Abstract▼ URL

Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.

#### Stick-breaking representations of sigma-Stable Poisson-Kingman models

Stefano Favaro, Maria Lomeli, Bernardo Nipoti, Yee Whye Teh, 2014. (Electronic Journal of Statistics).

Abstract▼ URL

In this paper we investigate the stick-breaking representation for the class of sigma-Stable Poisson-Kingman models, also known as Gibbs-type random probability measures. This class includes as special cases most of the discrete priors commonly used in Bayesian nonparametrics, such as the two parameter Poisson-Dirichlet process and the normalized generalized Gamma process. Under the assumption sigma=u/v, for any coprime integers 1 <= u < v such that u/v < 1/2, we show that a sigma-stable Poisson-Kingman model admits an explicit stick-breaking representation in terms of random variables which are obtained by suitably transforming Gamma random variables and products of independent Beta and Gamma random variables.

#### Combining the multicanonical ensemble with generative probabilistic models of local biomolecular structure

Jes Frellsen, Thomas Hamelryck, Jesper Ferkinghoff-Borg, 2014. (In Proceedings of the 59th World Statistics Congress of the International Statistical Institute). Hong Kong.

Abstract▼ URL

Markov chain Monte Carlo is a powerful tool for sampling complex systems such as large biomolecular structures. However, the standard Metropolis-Hastings algorithm suffers from a number of deficiencies when applied to systems with rugged free-energy landscapes. Some of these deficiencies can be addressed with the multicanonical ensemble. In this paper we will present two strategies for applying the multicanonical ensemble to distributions constructed from generative probabilistic models of local biomolecular structure. In particular, we will describe how to use the multicanonical ensemble efficiently in conjunction with the reference ratio method.

#### Variational Gaussian Process State-Space Models

Roger Frigola, Yutian Chen, Carl Edward Rasmussen, 2014. (In Advances in Neural Information Processing Systems 27). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger.

Abstract▼ URL

State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series.

#### Identification of Gaussian Process State-Space Models with Particle Stochastic Approximation EM

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, Carl Edward Rasmussen, 2014. (In Proceedings of the 19th World Congress of the International Federation of Automatic Control (IFAC)).

Abstract▼ URL

Gaussian process state-space models (GP-SSMs) are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional (hyper-)parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification.

#### Pitfalls in the use of Parallel Inference for the Dirichlet Process

Yarin Gal, Zoubin Ghahramani, 2014. (In Proceedings of the 31th International Conference on Machine Learning (ICML-14)).

Abstract▼ URL

Recent work done by Lovell, Adams, and Mansingka (2012) and Williamson, Dubey, and Xing (2013) has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it – work which has been picked-up and implemented in several different fields. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these requirements (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence, analysing the load balance for the inference and showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation. We end with suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.

#### Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Yarin Gal, Mark van der Wilk, Carl Rasmussen, 2014. (In Advances in Neural Information Processing Systems 27). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

#### Gaussian Process Volatility Model

Yue Wu, José Miguel Hernández-Lobato, Zoubin Ghahramani, December 2014. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

The prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the evolution of the variance. Moreover, functional parameters are usually learned by maximum likelihood, which can lead to overfitting. To address these problems we introduce GP-Vol, a novel non-parametric model for time-changing variances based on Gaussian Processes. This new model can capture highly flexible functional relationships for the variances. Furthermore, we introduce a new online algorithm for fast inference in GP-Vol. This method is much faster than current offline inference procedures and it avoids overfitting problems by following a fully Bayesian approach. Experiments with financial data show that GP-Vol performs significantly better than current standard alternatives.

#### Beta diffusion trees

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. The generative process for the tree is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet and Pitman–Yor diffusion trees (Neal, 2003b; Knowles & Ghahramani, 2011), both of which define tree structures over clusters of the particles. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in (a random number of) possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements.

#### Beta diffusion trees and hierarchical feature allocations

Creighton Heaukulani, David A. Knowles, Zoubin Ghahramani, August 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.

#### The combinatorial structure of beta negative binomial processes

Creighton Heaukulani, Daniel M. Roy, March 2014. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

We characterize the combinatorial structure of conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure. In Bayesian nonparametric applications, such processes have served as models for unknown multisets of a measurable space. Previous work has characterized random subsets arising from conditionally-i.i.d. sequences of Bernoulli processes with a common beta process base measure. In this case, the combinatorial structure is described by the Indian buffet process. Our results give a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide constructions for the beta negative binomial process that avoid a representation of the underlying beta process base measure.

#### Predictive Entropy Search for Efficient Global Optimization of Black-box Functions

José Miguel Hernández-Lobato, Matthew W. Hoffman, Zoubin Ghahramani, December 2014. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

We propose a novel information-theoretic approach for Bayesian optimization called Predictive Entropy Search (PES). At each iteration, PES selects the next evaluation point that maximizes the expected information gained with respect to the global maximum. PES codifies this intractable acquisition function in terms of the expected reduction in the differential entropy of the predictive distribution. This reformulation allows PES to obtain approximations that are both more accurate and efficient than other alternatives such as Entropy Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment of the model hyperparameters while ES cannot. We evaluate PES in both synthetic and realworld applications, including optimization problems in machine learning, finance, biotechnology, and robotics. We show that the increased accuracy of PES leads to significant gains in optimization performance.

#### Probabilistic Matrix Factorization with Non-random Missing Data

José Miguel Hernández-Lobato, Neil Houlsby, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We propose a probabilistic matrix factorization model for collaborative filtering that learns from data that is missing not at random (MNAR). Matrix factorization models exhibit state-of-the-art predictive performance in collaborative filtering. However, these models usually assume that the data is missing at random (MAR), and this is rarely the case. For example, the data is not MAR if users rate items they like more than ones they dislike. When the MAR assumption is incorrect, inferences are biased and predictive performance can suffer. Therefore, we model both the generative process for the data and the missing data mechanism. By learning these two models jointly we obtain improved performance over state-of-the-art methods when predicting the ratings and when modeling the data observation process. We present the first viable MF model for MNAR data. Our results are promising and we expect that further research on NMAR models will yield large gains in collaborative filtering.

#### Stochastic Inference for Scalable Probabilistic Modeling of Binary Matrices

José Miguel Hernández-Lobato, Neil Houlsby, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient because they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data used by our method. For this, we derive an algorithm that adjusts this hyper-parameter online.

#### On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning

Matthew W Hoffman, Bobak Shahriari, Nando de Freitas, April 2014. (In 17th International Conference on Artificial Intelligence and Statistics). Reykjavik, Iceland.

Abstract▼ URL

We address the problem of finding the maximizer of a nonlinear function that can only be evaluated, subject to noise, at a finite number of query locations. Further, we will assume that there is a constraint on the total number of permitted function evaluations. We introduce a Bayesian approach for this problem and show that it empirically outperforms both the existing frequentist counterpart and other Bayesian optimization methods. The Bayesian approach places emphasis on detailed modelling, including the modelling of correlations among the arms. As a result, it can perform well in situations where the number of arms is much larger than the number of allowed function evaluation, whereas the frequentist counterpart is inapplicable. This feature enables us to develop and deploy practical applications, such as automatic machine learning toolboxes. The paper presents comprehensive comparisons of the proposed approach with many Bayesian and bandit optimization techniques, the first comparison of many of these methods in the literature.

#### A Scalable Gibbs Sampler for Probabilistic Entity Linking

Neil Houlsby, Massimiliano Ciaramita, 2014. (In 36th European Conference on Information Retrieval). Springer.

Abstract▼ URL

Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset.

#### Cold-start Active Learning with Robust Ordinal Matrix Factorization

Neil Houlsby, José Miguel Hernández-Lobato, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We present a new matrix factorization model for rating data and a corresponding active learning strategy to address the cold-start problem. Cold-start is one of the most challenging tasks for recommender systems: what to recommend with new users or items for which one has little or no data. An approach is to use active learning to collect the most useful initial ratings. However, the performance of active learning depends strongly upon having accurate estimates of i) the uncertainty in model parameters and ii) the intrinsic noisiness of the data. To achieve these estimates we propose a heteroskedastic Bayesian model for ordinal matrix factorization. We also present a computationally efficient framework for Bayesian active learning with this type of complex probabilistic model. This algorithm successfully distinguishes between informative and noisy data points. Our model yields state-of-the-art predictive performance and, coupled with our active learning strategy, enables us to gain useful information in the cold-start setting from the very first active sample.

#### Adaptable probabilistic mapping of short reads using position specific scoring matrices

Peter Kerpedjiev, Jes Frellsen, Stinus Lindgreen, Anders Krogh, 2014. (BMC bioinformatics). **DOI**: 10.1186/1471-2105-15-100.

Abstract▼

BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases.RESULTS:We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.

**Comment:** Peter Kerpedjiev and Jes Frellsen contributed equally. Additional resources are available at bwa-pssm.binf.ku.dk

#### Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara, Yutian Chen, Max Welling, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time.

**Comment:** supplementary

#### A role for amplitude modulation phase relationships in speech rhythm perception

Victoria Leong, Michael A Stone, Richard E Turner, Usha Goswami, 2014. (Journal of the Acoustical Society of America). Acoustical Society of America.

Abstract▼ URL

Prosodic rhythm in speech [the alternation of “Strong” (S) and “weak” (w) syllables] is cued, among others, by slow rates of amplitude modulation (AM) within the speech envelope. However, it is unclear exactly which envelope modulation rates and statistics are the most important for the rhythm percept. Here, the hypothesis that the phase relationship between “Stress” rate ( 2 Hz) and “Syllable” rate ( 4 Hz) AMs provides a perceptual cue for speech rhythm is tested. In a rhythm judgment task, adult listeners identified AM tone-vocoded nursery rhyme sentences that carried either trochaic (S-w) or iambic patterning (w-S). Manipulation of listeners’ rhythm perception was attempted by parametrically phase-shifting the Stress AM and Syllable AM in the vocoder. It was expected that a 1π radian phase-shift (half a cycle) would reverse the perceived rhythm pattern (i.e., trochaic -> iambic) whereas a 2πradian shift (full cycle) would retain the perceived rhythm pattern (i.e., trochaic -> trochaic). The results confirmed these predictions. Listeners judgments of rhythm systematically followed Stress-Syllable AM phase-shifts, but were unaffected by phase-shifts between the Syllable AM and the Sub-beat AM ( 14 Hz) in a control condition. It is concluded that the Stress-Syllable AM phase relationship is an envelope-based modulation statistic that supports speech rhythm perception.

#### Automatic Construction and Natural-Language Description of Nonparametric Regression Models

James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani, July 2014. (In Association for the Advancement of Artificial Intelligence (AAAI)).

Abstract▼ URL

This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.

#### Randomized Nonlinear Component Analysis

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, Bernhard Schölkopf, 2014. (In ICML). JMLR.org. JMLR Proceedings.

Abstract▼ URL

Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.

#### Classification using log Gaussian Cox processes

Alexander G. D. G. Matthews, Zoubin Ghahramani, 2014. (arXiv preprint arXiv:1405.4141).

Abstract▼ URL

McCullagh and Yang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.

#### Comparing lower bounds on the entropy of mixture distributions for use in variational inference

Alexander G. D. G Matthews, James Hensman, Zoubin Ghahramani, December 2014. (In NIPS workshop on Advances in Variational Inference). Montreal, Canada.

#### Nonlinear Modelling and Control using Gaussian Processes

Andrew McHutchon, 2014. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

In many scientific disciplines it is often required to make predictions about how a system will behave or to deduce the correct control values to elicit a particular desired response. Efficiently solving both of these tasks relies on the construction of a model capturing the system’s operation. In the most interesting situations, the model needs to capture strongly nonlinear effects and deal with the presence of uncertainty and noise. Building models for such systems purely based on a theoretical understanding of underlying physical principles can be infeasibly complex and require a large number of simplifying assumptions. An alternative is to use a data-driven approach, which builds a model directly from observations. A powerful and principled approach to doing this is to use a Gaussian Process (GP). In this thesis we start by discussing how GPs can be applied to data sets which have noise affecting their inputs. We present the “Noisy Input GP”, which uses a simple local-linearisation to refer the input noise into heteroscedastic output noise, and compare it to other methods both theoretically and empirically. We show that this technique leads to a effective model for nonlinear functions with input and output noise. We then consider the broad topic of GP state space models for application to dynamical systems. We discuss a very wide variety of approaches for using GPs in state space models, including introducing a new method based on moment-matching, which consistently gave the best performance. We analyse the methods in some detail including providing a systematic comparison between approximate-analytic and particle methods. To our knowledge such a comparison has not been provided before in this area. Finally, we investigate an automatic control learning framework, which uses Gaussian Processes to model a system for which we wish to design a controller. Controller design for complex systems is a difficult task and thus a framework which allows an automatic design directly from data promises to be extremely useful. We demonstrate that the previously published framework cannot cope with the presence of observation noise but that the introduction of a state space model dramatically improves its performance. This contribution, along with some other suggested improvements opens the door for this framework to be used in real-world applications.

#### A reversible infinite HMM using normalised random measures

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.

#### The Inverse Regression Topic Model

Maxim Rabinovich, David M. Blei, 2014. (In 31st International Conference on Machine Learning).

Abstract▼

Recently, multinomial inverse regression (MNIR) has been proposed as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation algorithm and an online variant, which is suitable for large corpora. We apply these methods to a corpus of 73K Congressional press releases and another of 150K Yelp reviews, demonstrating that the IRTM outperforms both MNIR and supervised topic models on the prediction task. Further, we give examples showing that the IRTM enables systematic discovery of in-topic lexical variation, which is not possible with previous supervised topic models.

#### Probabilistic ODE Solvers with Runge-Kutta Means

Michael Schober, David Duvenaud, Philipp Hennig, June 2014. (arXiv preprint arXiv:1406.2582).

Abstract▼ URL

Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state-of-the-art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.

#### Student-t Processes as Alternatives to Gaussian Processes

Amar Shah, Andrew Gordon Wilson, Zoubin Ghahramani, 2014. (In AISTATS). JMLR.org. JMLR Proceedings.

Abstract▼ URL

We investigate the Student-t process as an alternative to the Gaussian process as a nonparametric prior over functions. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating away an inverse Wishart process prior over the covariance kernel of a Gaussian process model. We show surprising equivalences between different hierarchical Gaussian process models leading to Student-t processes, and derive a new sampling scheme for the inverse Wishart process, which helps elucidate these equivalences. Overall, we show that a Student-t process can retain the attractive properties of a Gaussian process – a nonparametric representation, analytic marginal and predictive distributions, and easy model selection through covariance kernels – but has enhanced flexibility, and predictive covariances that, unlike a Gaussian process, explicitly depend on the values of training observations. We verify empirically that a Student-t process is especially useful in situations where there are changes in covariance structure, or in applications like Bayesian optimization, where accurate predictive covariances are critical for good performance. These advantages come at no additional computational cost over Gaussian processes.

#### Clamping Variables and Approximate Inference

Adrian Weller, Tony Jebara, 2014. (In Advances in Neural Information Processing Systems 28). Edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

It was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition function is upper bounded by the true partition function for a binary pairwise model that is attractive. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise (and hence improve) the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error.

**Comment:** Supplementary Material

#### Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes

Andrew Gordon Wilson, 2014. University of Cambridge, Cambridge, UK.

Abstract▼ URL

Truly intelligent systems are capable of pattern discovery and extrapolation without human intervention. Bayesian nonparametric models, which can uniquely represent expressive prior information and detailed inductive biases, provide a distinct opportunity to develop intelligent systems, with applications in essentially any learning and prediction task. Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. A covariance kernel determines the support and inductive biases of a Gaussian process. In this thesis, we introduce new covariance kernels to enable fast automatic pattern discovery and extrapolation with Gaussian processes. In the introductory chapter, we discuss the high level principles behind all of the models in this thesis: 1) we can typically improve the predictive performance of a model by accounting for additional structure in data; 2) to automatically discover rich structure in data, a model must have large support and the appropriate inductive biases; 3) we most need expressive models for large datasets, which typically provide more information for learning structure, and 4) we can often exploit the existing inductive biases (assumptions) or structure of a model for scalable inference, without the need for simplifying assumptions. In the context of this introduction, we then discuss, in chapter 2, Gaussian processes as kernel machines, and my views on the future of Gaussian process research. In chapter 3 we introduce the Gaussian process regression network (GPRN) framework, a multi-output Gaussian process method which scales to many output variables, and accounts for input-dependent correlations between the outputs. Underlying the GPRN is a highly expressive kernel, formed using an adaptive mixture of latent basis functions in a neural network like architecture. The GPRN is capable of discovering expressive structure in data. We use the GPRN to model the time-varying expression levels of 1000 genes, the spatially varying concentrations of several distinct heavy metals, and multivariate volatility (input dependent noise covariances) between returns on equity indices and currency exchanges, which is particularly valuable for portfolio allocation. We generalise the GPRN to an adaptive network framework, which does not depend on Gaussian processes or Bayesian nonparametrics; and we outline applications for the adaptive network in nuclear magnetic resonance (NMR) spectroscopy, ensemble learning, and change-point modelling. In chapter 4 we introduce simple closed form kernel for automatic pattern discovery and extrapolation. These spectral mixture (SM) kernels are derived by modelling the spectral densiy of a kernel (its Fourier transform) using a scale-location Gaussian mixture. SM kernels form a basis for all stationary covariances, and can be used as a drop-in replacement for standard kernels, as they retain simple and exact learning and inference procedures. We use the SM kernel to discover patterns and perform long range extrapolation on atmospheric CO2 trends and airline passenger data, as well as on synthetic examples. We also show that the SM kernel can be used to automatically reconstruct several standard covariances. The SM kernel and the GPRN are highly complementary; we show that using the SM kernel with adaptive basis functions in a GPRN induces an expressive prior over non-stationary kernels. In chapter 5 we introduce GPatt, a method for fast multidimensional pattern extrapolation, particularly suited to imge and movie data. Without human intervention – no hand crafting of kernel features, and no sophisticated initialisation procedures – we show that GPatt can solve large scale pattern extrapolation, inpainting and kernel discovery problems, including a problem with 383,400 training points. GPatt exploits the structure of a spectral mixture product (SMP) kernel, for fast yet exact inference procedures. We find that GPatt significantly outperforms popular alternative scalable gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits existing model structure are useful in combination for modelling large scale multidimensional patterns. The models in this dissertation have proven to be scalable and with greatly enhanced predictive performance over the alternatives: the extra structure being modelled is an important part of a wide variety of real data – including problems in econometrics, gene expression, geostatistics, nuclear magnetic resonance spectroscopy, ensemble learning, multi-output regression, change point modelling, time series, multivariate volatility, image inpainting, texture extrapolation, video extrapolation, acoustic modelling, and kernel discovery.

#### Bayesian Inference for NMR Spectroscopy with Applications to Chemical Quantification

Andrew Gordon Wilson, Yuting Wu, Daniel J. Holland, Sebastian Nowozin, Mick D. Mantle, Lynn F. Gladden, Andrew Blake, 2014. (arXiv preprint arXiv 1402.3580).

Abstract▼ URL

Nuclear magnetic resonance (NMR) spectroscopy exploits the magnetic properties of atomic nuclei to discover the structure, reaction state and chemical environment of molecules. We propose a probabilistic generative model and inference procedures for NMR spectroscopy. Specifically, we use a weighted sum of trigonometric functions undergoing exponential decay to model free induction decay (FID) signals. We discuss the challenges in estimating the components of this general model – amplitudes, phase shifts, frequencies, decay rates, and noise variances – and offer practical solutions. We compare with conventional Fourier transform spectroscopy for estimating the relative concentrations of chemicals in a mixture, using synthetic and experimentally acquired FID signals. We find the proposed model is particularly robust to low signal to noise ratios (SNR), and overlapping peaks in the Fourier transform of the FID, enabling accurate predictions (e.g., 1% error at low SNR) which are not possible with conventional spectroscopy (5% error).

## 2013

#### A Generative Model of Vector Space Semantics

Jacob Andreas, Zoubin Ghahramani, 2013. (ACL 2013).

Abstract▼ URL

We present a novel compositional, generative model for vector space representations of meaning. This model reformulates earlier tensor-based approaches to vector space semantics as a top-down process, and provides efficient algorithms for transformation from natural language to vectors and from vectors to natural language. We describe procedures for estimating the parameters of the model from positive examples of similar phrases, and from distributional representations, then use these procedures to obtain similarity judgments for a set of adjective-noun pairs. The model’s estimation of the similarity of these pairs correlates well with human annotations, demonstrating a substantial improvement over several existing compositional approaches in both settings.

#### Bayesian Structured Prediction Using Gaussian Processes

Sébastien Bratières, Novi Quadrianto, Zoubin Ghahramani, 2013. (arXiv).

Abstract▼ URL

We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M3N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct.

#### Structure Discovery in Nonparametric Regression through Compositional Kernel Search

David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.

#### Model Reductions for Inference: Generality of Pairwise, Binary, and Planar Factor Graphs

Frederik Eaton, Zoubin Ghahramani, 2013. (Neural Computation).

Abstract▼ URL

We offer a solution to the problem of efficiently translating algorithms between different types of discrete statistical model. We investigate the expressive power of three classes of model-those with binary variables, with pairwise factors, and with planar topology-as well as their four intersections. We formalize a notion of “simple reduction” for the problem of inferring marginal probabilities and consider whether it is possible to “simply reduce” marginal inference from general discrete factor graphs to factor graphs in each of these seven subclasses. We characterize the reducibility of each class, showing in particular that the class of binary pairwise factor graphs is able to simply reduce only positive models. We also exhibit a continuous “spectral reduction” based on polynomial interpolation, which overcomes this limitation. Experiments assess the performance of standard approximate inference algorithms on the outputs of our reductions.

#### Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Roger Frigola, Fredrik Lindsten, Thomas B. Schön, Carl Edward Rasmussen, 2013. (In Advances in Neural Information Processing Systems 26). Edited by L. Bottou, C.J.C. Burges, Z. Ghahramani, M. Welling, K.Q. Weinberger. Curran Associates, Inc..

Abstract▼ URL

State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.

#### Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

Roger Frigola, Carl Edward Rasmussen, 2013. (In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on).

Abstract▼ URL

We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system’s dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.

#### A Systematic Bayesian Treatment of the IBM Alignment Models

Yarin Gal, Phil Blunsom, 2013. (In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies). Association for Computational Linguistics.

Abstract▼ URL

The dominant yet ageing IBM and HMM word alignment models underpin most popular Statistical Machine Translation implementations in use today. Though beset by the limitations of implausible independence assumptions, intractable optimisation problems, and an excess of tunable parameters, these models provide a scalable and reliable starting point for inducing translation systems. In this paper we build upon this venerable base by recasting these models in the non-parametric Bayesian framework. By replacing the categorical distributions at their core with hierarchical Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we provide a more flexible formulation and sidestep the original heuristic optimisation techniques. The resulting models are highly extendible, naturally permitting the introduction of phrasal dependencies. We present extensive experimental results showing improvements in both AER and BLEU when benchmarked against Giza++, including significant improvements over IBM model 4.

#### Bayesian nonparametrics and the probabilistic approach to modelling

Zoubin Ghahramani, 2013. (Philosophical Transactions of the Royal Society A).

Abstract▼ URL

Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian nonparametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics for modelling unknown functions, density estimation, clustering, time series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman’s coalescent, Dirichlet diffusion tress, and Wishart processes.

#### Scaling Multidimensional Gaussian Processes using Projected Additive Approximations

E. Gilboa, Yunus Saatçi, John P. Cunningham, 2013. (In 30th International Conference on Machine Learning).

Abstract▼ URL

Exact Gaussian Process (GP) regression has O(N3) runtime for data size N, making it intractable for large N. Many algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure inherent in particular covariance functions, including GPs with implied Markov structure, and equispaced inputs (both enable O(N) runtime). However, these GP advances have not been extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests novel extensions of structured GPs to multidimensional inputs. We present new methods for additive GPs, showing a novel connection between the classic backﬁtting method and the Bayesian framework. To achieve optimal accuracy-complexity tradeoff, we extend this model with a novel variant of projection pursuit regression. Our primary result – projection pursuit Gaussian Process Regression – shows orders of magnitude speedup while preserving high accuracy. The natural second and third steps include non-Gaussian observations and higher dimensional equispaced grid methods. We introduce novel techniques to address both of these necessary directions. We thoroughly illustrate the power of these three advances on several datasets, achieving close performance to the naive Full GP at orders of magnitude less cost.

#### Learning Feature Selection Dependencies in Multi-task Learning

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, December 2013. (In Advances in Neural Information Processing Systems 27). Lake Tahoe, California, USA.

Abstract▼ URL

A probabilistic model based on the horseshoe prior is proposed for learning de- pendencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data.

#### Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont, July 2013. (Journal of Machine Learning Research).

Abstract▼ URL

We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy.

#### Stochastic Inference for Scalable Probabilistic Modeling of Binary Matrices

José Miguel Hernández-Lobato, Neil Houlsby, Zoubin Ghahramani, 2013. (In NIPS Workshop on Randomized Methods for Machine Learning).

Abstract▼ URL

Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization (PMF) methods are an attractive solution. However, current batch algorithms for PMF can be inefficient since they need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data and we propose an algorithm that adjusts this hyper-parameter in an online manner.

#### Gaussian Process Conditional Copulas with Applications to Financial Time Series

José Miguel Hernández-Lobato, James Robert Lloyds, Daniel Hernández-Lobato, December 2013. (In Advances in Neural Information Processing Systems 27). Lake Tahoe, California, USA.

Abstract▼ URL

The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other time-varying copula methods.

#### Statistical Fitting of Undrained Strength Data

Neil Houlsby, Guy Houlsby, 2013. (Geotechnique). Telford. **DOI**: 10.1680/geot.13.P.007.

Abstract▼ URL

We describe an approach, based on Bayesian statistical methods, that allows the fitting of a design profile to a set of measurements of undrained strengths. In particular we allow for the automatic determination of not only the positions of boundaries between geological units, but also the selection of the number of units to model the data in an appropriate way.

#### Cognitive tomography reveals complex task-independent mental representations

Neil Houlsby, Ferenc Huszár, Mohammad M Ghassemi, Gergő Orbán, Daniel M Wolpert, Máté Lengyel, 2013. (Current Biology). **DOI**: 10.1016/j.cub.2013.09.012.

Abstract▼ URL

Humans develop rich mental representations that guide their behavior in a variety of every-day tasks. However, it is unknown whether these representations, often formalized as priors in Bayesian inference, are specific for each task or subserve multiple tasks. Current approaches cannot distinguish between these two possibilities because they cannot extract comparable representations across different tasks. Here, we develop a novel method, termed cognitive tomography, that can extract complex, multi-dimensional priors across tasks. We apply this method to human judgments in two qualitatively different tasks, familiarity and odd-one-out, involving an ecologically relevant set of stimuli, human faces. We show that priors over faces are structurally complex and vary dramatically across subjects, but are invariant across the tasks within each subject. The priors we extract from each task allow us to predict with high precision the behavior of subjects for novel stimuli both in the same task as well as in the other task. Our results provide the first evidence for a single high-dimensional structured representation of a naturalistic stimulus set that guides behavior in multiple tasks. Moreover, the representations estimated by cognitive tomography can provide independent, behavior-based regressors for elucidating the neural correlates of complex naturalistic priors.

#### Warped Mixtures for Nonparametric Cluster Shapes

Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani, July 2013. (In 29th Conference on Uncertainty in Artificial Intelligence). Bellevue, Washington.

Abstract▼ URL

A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.

#### Active Learning for Interactive Visualization

Tomoharu Iwata, Neil Houlsby, Zoubin Ghahramani, 2013. (In 16th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Many automatic visualization methods have been proposed. However, a visualization that is automatically generated might be different to how a user wants to arrange the objects in visualization space. By allowing users to re-locate objects in the embedding space of the visualization, they can adjust the visualization to their preference. We propose an active learning framework for interactive visualization which selects objects for the user to re-locate so that they can obtain their desired visualization by re-locating as few as possible. The framework is based on an information theoretic criterion, which favors objects that reduce the uncertainty of the visualization. We present a concrete application of the proposed framework to the Laplacian eigenmap visualization method. We demonstrate experimentally that the proposed framework yields the desired visualization with fewer user interactions than existing methods.

#### Experimental Adaptive Bayesian Tomography

Konstantin Kravtsov, Stanislav Straupe, Igor Radchenko, Neil Houlsby, Ferenc Huszár, Sergey Kulik, 2013. (Physical Review A). APS.

Abstract▼ URL

We report an experimental realization of an adaptive quantum state tomography protocol. Our method takes advantage of a Bayesian approach to statistical inference and is naturally tailored for adaptive strategies. For pure states we observe close to N^-1 scaling of infidelity with overall number of registered events, while best non-adaptive protocols allow for N^-1/2 scaling only. Experiments are performed for polarization qubits, but the approach is readily adapted to any dimension.

#### SIGMa: simple greedy matching for aligning large knowledge bases

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, Zoubin Ghahramani, 2013. (In KDD). Association for Computing Machinery. **ISBN**: 978-1-4503-2174-7.

Abstract▼ URL

The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world’s largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

#### GEFCom2012 Hierarchical Load Forecasting: Gradient Boosting Machines and Gaussian Processes

James Robert Lloyd, 2013. (International Journal of Forecasting).

Abstract▼ URL

This report discusses methods for forecasting hourly loads of a US utility as part of the load forecasting track of the Global Energy Forecasting Competition 2012 hosted on Kaggle. The methods described (gradient boosting machines and Gaussian processes) are generic machine learning / regression algorithms and few domain specific adjustments were made. Despite this, the algorithms were able to produce highly competitive predictions and hopefully they can inspire more reﬁned techniques to compete with state-of-the-art load forecasting methodologies.

#### The Randomized Dependence Coefficient

David Lopez-Paz, Philipp Hennig, Bernhard Scholköpf, December 2013. (In Advances in Neural Information Processing Systems 27). Lake Tahoe, California, USA.

Abstract▼ URL

We introduce the Randomized Dependence Coefficient (RDC), a measure of non-linear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper.

#### Gaussian Process Vine Copulas for Multivariate Dependence

David Lopez-Paz, José Miguel Hernández-Lobato, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

#### On the Accuracy of Short Read Mapping

Peter Menzel, Jes Frellsen, Mireya Plass, Simon H. Rasmussen, Anders Krogh, 2013. (In Deep Sequencing Data Analysis). Springer. **DOI**: 10.1007/978-1-62703-514-9_3.

Abstract▼

The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.

**Comment:** Peter Menzel and Jes Frellsen contributed equally.

#### The Supervised IBP: Neighbourhood Preserving Infinite Latent Feature Models

Novi Quadrianto, Viktoriia Sharmanska, David A. Knowles, Zoubin Ghahramani, July 2013. (In 29th Conference on Uncertainty in Artificial Intelligence). Bellevue, USA.

Abstract▼ URL

We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.

#### Scaling the Indian Buffet Process via Submodular Maximization

Colorado Reed, Zoubin Ghahramani, 2013. (In ICML). JMLR.org. JMLR Proceedings.

Abstract▼ URL

Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)’s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model.

#### Determinantal Clustering Processes - A Nonparametric Bayesian Approach to Kernel Based Semi-Supervised Clustering

Amar Shah, Zoubin Ghahramani, 2013. (UAI).

Abstract▼ URL

Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the number of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets.

#### Learning to Rank Using Privileged Information

Viktoriia Sharmanska, Novi Quadrianto, Christoph Lampert, 2013. (In International Conference on Computer Vision).

Abstract▼

Many computer vision problems have an asymmetric distribution of information between training and test time. In this work, we study the case where we are given additional information about the training data, which however will not be available at test time. This situation is called learning using privileged information (LUPI). We introduce two maximum-margin techniques that are able to make use of this additional source of information, and we show that the framework is applicable to several scenarios that have been studied in computer vision before. Experiments with attributes, bounding boxes, image tags and rationales as additional information in object classification show promising results.

#### Gaussian Process Kernels for Pattern Discovery and Extrapolation

Andrew Gordon Wilson, Ryan Prescott Adams, February 18 2013. (In 30th International Conference on Machine Learning).

Abstract▼ URL

Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density – the Fourier transform of a kernel – with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework.

**Comment:** arXiv:1302.4245

#### GPatt: Fast Multidimensional Pattern Extrapolation with Gaussian Processes

Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, John P Cunningham, 2013. (arXiv preprint arXiv:1310.5288).

Abstract▼ URL

Gaussian processes are typically used for smoothing and interpolation on small datasets. We introduce a new Bayesian nonparametric framework – GPatt – enabling automatic pattern extrapolation with Gaussian processes on large multidimensional datasets. GPatt unifies and extends highly expressive kernels and fast exact inference techniques. Without human intervention – no hand crafting of kernel features, and no sophisticated initialisation procedures – we show that GPatt can solve large scale pattern extrapolation, inpainting, and kernel discovery problems, including a problem with 383,400 training points. We find that GPatt significantly outperforms popular alternative scalable Gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits model structure are useful in combination for modelling large scale multidimensional patterns.

#### Dynamic Covariance Models for Multivariate Financial Time Series

Yue Wu, José Miguel Hernández-Lobato, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.

#### Transcriptional data: a new gateway to drug repositioning?

Francesco Iorio, Timothy Rittman, Hong Ge, Michael Menden, Julio Saez-Rodriguez, April 2013. (Drug Discovery Today).

## 2012

#### The dynamic beamformer

A. Bahramisharif, M. A. J. van Gerven, J-M. Schoffelen, Z. Ghahramani, T. Heskes, 2012. (In Machine Learning in Interpretation of Neuroimaging (MLINI) 2011 LNAI 7263). Edited by G. Langs et al.

Abstract▼ URL

Beamforming is one of the most commonly used methods for estimating the active neural sources from the MEG or EEG sensor readings. The basic assumption in beamforming is that the sources are uncorrelated, which allows for estimating each source independent of the others. In this paper, we incorporate the independence assumption of the standard beamformer in a linear dynamical system, thereby introducing the dynamic beamformer. Using empirical data, we show that the dynamic beamformer outperforms the standard beamformer in predicting the condition of interest which strongly suggests that it also outperforms the standard method in localizing the active neural generators.

#### Gaussian Processes for time-marked time-series data

John P. Cunningham, Zoubin Ghahramani, Carl Edward Rasmussen, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data.

#### Robust Filtering and Smoothing with Gaussian Processes

Marc Peter Deisenroth, Ryan D. Turner, Marco F. Huber, Uwe D. Hanebeck, Carl Edward Rasmussen, 2012. (IEEE Transactions on Automatic Control). **DOI**: 10.1109/TAC.2011.2179426.

Abstract▼ URL

We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by nonparametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of “system identification” is more robust than finding point estimates of a parametric function representation. Our principled filtering/smoothing approach for GP dynamic systems is based on analytic moment matching in the context of the forward-backward algorithm. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.

#### Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Joseph Hall, Carl Edward Rasmussen, Jan Maciejowski, 2012. (In 51st IEEE Conference on Decision and Control).

Abstract▼ URL

Gaussian processes are gaining increasing popularity among the control community, in particular for the modelling of discrete time state space systems. However, it has not been clear how to incorporate model information, in the form of known state relationships, when using a Gaussian process as a predictive model. An obvious example of known prior information is position and velocity related states. Incorporation of such information would be beneficial both computationally and for faster dynamics learning. This paper introduces a method of achieving this, yielding faster dynamics learning and a reduction in computational effort from O(Dn2) to O((D-F)n2) in the prediction stage for a system with D states, F known state relationships and n observations. The effectiveness of the method is demonstrated through its inclusion in the PILCO learning algorithm with application to the swing-up and balance of a torque-limited pendulum and the balancing of a robotic unicycle in simulation.

#### Collaborative Gaussian Processes for Preference Learning

Neil Houlsby, Jose Miguel Hernández-Lobato, Ferenc Huszár, Zoubin Ghahramani, 2012. (In Advances in Neural Information Processing Systems 26). Curran Associates, Inc..

Abstract▼ URL

We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a *preference kernel* for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms.

#### Optimally-Weighted Herding is Bayesian Quadrature

Ferenc Huszár, David Duvenaud, July 2012. (In 28th Conference on Uncertainty in Artificial Intelligence). Catalina Island, California.

Abstract▼ URL

Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.

#### Adaptive Bayesian Quantum Tomography

Ferenc Huszár, Neil Houlsby, 2012. (Physical Review A). APS.

Abstract▼ URL

In this paper we revisit the problem of optimal design of quantum tomographic experiments. In contrast to previous approaches where an optimal set of measurements is decided in advance of the experiment, we allow for measurements to be adaptively and efficiently re-optimised depending on data collected so far. We develop an adaptive statistical framework based on Bayesian inference and Shannon’s information, and demonstrate a ten-fold reduction in the total number of measurements required as compared to non-adaptive methods, including mutually unbiased bases.

#### Bayesian Classifier Combination

Hyun-Chul Kim, Zoubin Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Bayesian model averaging linearly mixes the probabilistic predictions of multiple models, each weighted by its posterior probability. This is the coherent Bayesian way of combining multiple models only under certain restrictive assumptions, which we outline. We explore a general framework for Bayesian model combination (which differs from model averaging) in the context of classification. This framework explicitly models the relationship between each model’s output and the unknown true label. The framework does not require that the models be probabilistic (they can even be human assessors), that they share prior information or receive the same training data, or that they be independent in their errors. Finally, the Bayesian combiner does not need to believe any of the models is in fact correct. We test several variants of this classifier combination procedure starting from a classic statistical model proposed by Dawid and Skene (1979) and using MCMC to add more complex but important features to the model. Comparisons on sev- eral data sets to simpler methods like majority voting show that the Bayesian methods not only perform well but result in interpretable diagnostics on the data points and the models.

#### Random function priors for exchangeable arrays with applications to graphs and relational data

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.

#### Semi-Supervised Domain Adaptation with Non-Parametric Copulas

David Lopez-Paz, José Miguel Hernández-Lobato, Bernhard Scholköpf, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques.

#### Bayesian and L1 Approaches for Sparse Unsupervised Learning

Shakir Mohamed, Katherine A. Heller, Zoubin Ghahramani, 2012. (In 29th International Conference on Machine Learning).

Abstract▼ URL

The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of “L1”, and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods out-perform L1 minimisation, even on a com- putational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

#### A Nonparametric Bayesian Model for Multiple Clustering with Overlapping Feature Views

Donglin Niu, Jennifer G. Dy, Z. Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Chinese Restaurant Process. We provide an inference approach to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view.

#### Active Learning of Model Evidence Using Bayesian Quadrature

Michael A. Osborne, David Duvenaud, Roman Garnett, Carl Edward Rasmussen, Stephen J. Roberts, Zoubin Ghahramani, December 2012. (In Advances in Neural Information Processing Systems 25). Lake Tahoe, California, USA.

Abstract▼ URL

Numerical integration is a key component of many problems in scientiﬁc computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientiﬁc problem from astronomy.

#### An Infinite Latent Attribute Model for Network Data

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a “flat” clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

#### A nonparametric variable clustering model

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems.

#### Copula-based Kernel Dependency Measures

Barnabas Poczos, Zoubin Ghahramani, Jeff Schneider, 2012. (In 29th International Conference on Machine Learning).

Abstract▼ URL

The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions.

#### The Most Persistent Soft-Clique in a Set of Sampled Graphs

Novi Quadrianto, Chao Chen, Christoph Lampert, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1) is almost fully or at least densely connected, 2) occurs in all or almost all graph instances, and 3) has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a) a max-min two person game optimization problem, or b) a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations.

#### Kernel adaptive Metropolis-Hastings

Dino Sejdinovic, Heiko Strathmann, Maria Lomeli, Christophe Andrieu, Arthur Gretton, June 2012. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

A Kernel Adaptive Metropolis-Hastings algo- rithm is introduced, for the purpose of sampling from a target distribution with strongly nonlin- ear support. The algorithm embeds the trajec- tory of the Markov chain into a reproducing ker- nel Hilbert space (RKHS), such that the fea- ture space covariance of the samples informs the choice of proposal. The procedure is com- putationally efficient and straightforward to im- plement, since the RKHS moves can be inte- grated out analytically: our proposal distribu- tion in the original space is a normal distribution whose mean and covariance depend on where the current sample lies in the support of the tar- get distribution, and adapts to its local covari- ance structure. Furthermore, the procedure re- quires neither gradients nor any other higher or- der information about the target, making it par- ticularly attractive for contexts such as Pseudo- Marginal MCMC. Kernel Adaptive Metropolis- Hastings outperforms competing fixed and adap- tive samplers on multivariate, highly nonlinear target distributions, arising in both real-world and synthetic examples.

#### Augmented Attributes Representations

Viktoriia Sharmanska, Novi Quadrianto, Christoph Lampert, 2012. (In 12th European Conference on Computer Vision).

Abstract▼ URL

We propose a new learning method to infer a mid-level feature representation that combines the advantage of semantic attribute representations with the higher expressive power of non-semantic features. The idea lies in augmenting an existing attribute-based representation with additional dimensions for which an autoencoder model is coupled with a large-margin principle. This construction allows a smooth transition between the zero-shot regime with no training example, the unsupervised regime with training examples but without class labels, and the supervised regime with training examples and with class labels. The resulting optimization problem can be solved efficiently, because several of the necessity steps have closed-form solutions. Through extensive experiments we show that the augmented representation achieves better results in terms of object categorization accuracy than the semantic representation alone.

#### Robust estimation of local genetic ancestry in admixed populations using a non-parametric Bayesian approach

Kyung-Ah Sohn, Zoubin Ghahramani, Eric P. Xing, 2012. (Genetics).

Abstract▼ URL

We present a new haplotype-based approach for inferring local genetic ancestry of individuals in an admixed population. Most existing approaches for local ancestry estimation ignore the latent genetic relatedness between ancestral populations and treat them as independent. In this paper, we exploit such information by building an inheritance model that describes both the ancestral populations and the admixed population jointly in a unified framework. Based on an assumption that the common hypothetical founder haplotypes give rise to both the ancestral and admixed population haplotypes, we employ an infinite hidden Markov model to characterize each ancestral population and further extend it to generate the admixed population. Through an effective utilization of the population structural information under a principled nonparametric Bayesian framework, the resulting model is significantly less sensitive to the choice and the amount of training data for ancestral populations than state-of-the-arts algorithms. We also improve the robustness under deviation from common modeling assumptions by incorporating population-specific scale parameters that allow variable recombination rates in different populations. Our method is applicable to an admixed population from an arbitrary number of ancestral populations and also performs competitively in terms of spurious ancestry proportions under general multi-way admixture assumption. We validate the proposed method by simulation under various admixing scenarios and present empirical analysis results on worldwide distributed dataset from Human Genome Diversity Project.

**Comment:** doi: 10.1534/genetics.112.140228

#### Flexible Martingale Priors for Deep Hierarchies

Jacob Steinhardt, Zoubin Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

When building priors over trees for Bayesian hierarchical models, there is a tension between maintaining desirable theoretical properties such as infinite exchangeability and important practical properties such as the ability to increase the depth of the tree to accommodate new data. We resolve this tension by presenting a family of infinitely exchangeable priors over discrete tree structures that allows the depth of the tree to grow with the data, and then showing that our family contains all hierarchical models with certain mild symmetry properties. We also show that deep hierarchical models are in general intimately tied to a process called a martingale, and use Doob’s martingale convergence theorem to demonstrate some unexpected properties of deep hierarchies.

#### Model based learning of sigma points in unscented Kalman filtering

Ryan D. Turner, Carl Edward Rasmussen, 2012. (Neurocomputing). **DOI**: 10.1016/j.neucom.2011.07.029.

Abstract▼ URL

The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

#### Decomposing signals into a sum of amplitude and frequency modulated sinusoids using probabilistic inference

Richard E. Turner, Maneesh Sahani, march 2012. (In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on). **DOI**: 10.1109/ICASSP.2012.6288343. **ISSN**: 1520-6149.

Abstract▼ URL

There are many methods for decomposing signals into a sum of amplitude and frequency modulated sinusoids. In this paper we take a new estimation based approach. Identifying the problem as ill-posed, we show how to regularize the solution by imposing soft constraints on the amplitude and phase variables of the sinusoids. Estimation proceeds using a version of Kalman smoothing. We evaluate the method on synthetic and natural, clean and noisy signals, showing that it outperforms previous decompositions, but at a higher computational cost.

#### Modelling Input Varying Correlations between Multiple Responses

Andrew Gordon Wilson, Zoubin Ghahramani, 2012. (In ECML/PKDD). Edited by Peter A. Flach, Tijl De Bie, Nello Cristianini. Springer. Lecture Notes in Computer Science. **ISBN**: 978-3-642-33485-6.

Abstract▼ URL

We introduced a generalised Wishart process (GWP) for modelling input dependent covariance matrices Σ(x), allowing one to model input varying correlations and uncertainties between multiple response variables. The GWP can naturally scale to thousands of response variables, as opposed to competing multivariate volatility models which are typically intractable for greater than 5 response variables. The GWP can also naturally capture a rich class of covariance dynamics – periodicity, Brownian motion, smoothness, …– through a covariance kernel.

#### Gaussian Process Regression Networks

Andrew Gordon Wilson, David A. Knowles, Zoubin Ghahramani, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.

#### Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, Zoubin Ghahramani, 2012. (In NIPS). Edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, Kilian Q. Weinberger.

Abstract▼ URL

Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems.

## 2011

#### Testing a Bayesian Measure of Representativeness Using a Large Image Database

Joshua Abbott, Katherine A. Heller, Zoubin Ghahramani, Thomas L. Griffiths, 2011. (In Advances in Neural Information Processing Systems 24). Cambridge, MA, USA. The MIT Press.

Abstract▼

How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to deﬁne a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories.

#### Path Integral Control and Bounded Rationality

Daniel A. Braun, Pedro A. Ortega, Evangelos Theodorou, Stefan Schaal, 2011. (In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning).

Abstract▼ URL

Path integral methods have recently been shown to be applicable to a very general class of optimal control problems. Here we examine the path integral formalism from a decision-theoretic point of view, since an optimal controller can always be regarded as an instance of a perfectly rational decision-maker that chooses its actions so as to maximize its expected utility. The problem with perfect rationality is, however, that finding optimal actions is often very difficult due to prohibitive computational resource costs that are not taken into account. In contrast, a bounded rational decision-maker has only limited resources and therefore needs to strike some compromise between the desired utility and the required resource costs. In particular, we suggest an information-theoretic measure of resource costs that can be derived axiomatically. As a consequence we obtain a variational principle for choice probabilities that trades off maximizing a given utility criterion and avoiding resource costs that arise due to deviating from initially given default choice probabilities. The resulting bounded rational policies are in general probabilistic. We show that the solutions found by the path integral formalism are such bounded rational policies. Furthermore, we show that the same formalism generalizes to discrete control problems, leading to linearly solvable bounded rational control policies in the case of Markov systems. Importantly, Bellman’s optimality principle is not presupposed by this variational principle, but it can be derived as a limit case. This suggests that the information theoretic formalization of bounded rationality might serve as a general principle in control design that unifies a number of recently reported approximate optimal control methods both in the continuous and discrete domain.

#### Motor coordination: When two have to act as one

Daniel A. Braun, Pedro A. Ortega, Daniel M. Wolpert, 2011. (Special issue of Experimental Brain Research on Joint Action).

Abstract▼ URL

Trying to pass someone walking toward you in a narrow corridor is a familiar example of a two-person motor game that requires coordination. In this study, we investigate coordination in sensorimotor tasks that correspond to classic coordination games with multiple Nash equilibria, such as “choosing sides”, “stag hunt”, “chicken”, and “battle of sexes”. In these tasks, subjects made reaching movements reflecting their continuously evolving “decisions” while they received a continuous payoff in the form of a resistive force counteracting their movements. Successful coordination required two subjects to “choose” the same Nash equilibrium in this force-payoff landscape within a single reach. We found that on the majority of trials coordination was achieved. Compared to the proportion of trials in which miscoordination occurred, successful coordination was characterized by several distinct features: an increased mutual information between the players’ movement endpoints, an increased joint entropy during the movements, and by differences in the timing of the players’ responses. Moreover, we found that the probability of successful coordination depends on the players’ initial distance from the Nash equilibria. Our results suggest that two-person coordination arises naturally in motor interactions and is facilitated by favorable initial positions, stereotypical motor pattern, and differences in response times.

#### A closed-loop human simulator for investigating the role of feedback-control in brain-machine interfaces

J. P. Cunningham, P. Nuyujukian, V. Gilja, C. A. Chestek, S. I. Ryu, K. V. Shenoy., 2011. (Journal of Neurophysiology).

Abstract▼ URL

Neural prosthetic systems seek to improve the lives of severely disabled people by decoding neural activity into useful behavioral commands. These systems and their decoding algorithms are typically developed “offline”, using neural activity previously gathered from a healthy animal, and the decoded movement is then compared with the true movement that accompanied the recorded neural activity. However, this offline design and testing may neglect important features of a real prosthesis, most notably the critical role of feedback control, which enables the user to adjust neural activity while using the prosthesis. We hypothesize that under- standing and optimally designing high-performance decoders require an experimental platform where humans are in closed-loop with the various candidate decode systems and algorithms. It remains unexplored the extent to which the subject can, for a particular decode system, algorithm, or parameter, engage feedback and other strategies to improve decode performance. Closed-loop testing may suggest different choices than offline analyses. Here we ask if a healthy human subject, using a closed-loop neural prosthesis driven by synthetic neural activity, can inform system design. We use this online pros- thesis simulator (OPS) to optimize “online” decode performance based on a key parameter of a current state-of-the-art decode algorithm, the bin width of a Kalman filter. First, we show that offline and online analyses indeed suggest different parameter choices. Previous literature and our offline analyses agree that neural activity should be analyzed in bins of 100- to 300-ms width. OPS analysis, which incorporates feedback control, suggests that much shorter bin widths (25-50 ms) yield higher decode performance. Second, we confirm this surprising finding using a closed-loop rhesus monkey prosthetic system. These findings illustrate the type of discovery made possible by the OPS, and so we hypothesize that this novel testing approach will help in the design of prosthetic systems that will translate well to human patients.

#### Language-independent Bayesian sentiment mining of Twitter

A. Davies, Z. Ghahramani, August 2011. (In In The Fifth Workshop on Social Network Mining and Analysis (SNA-KDD 2011)).

Abstract▼ URL

This paper outlines a new language-independent model for sentiment analysis of short, social-network statuses. We demonstrate this on data from Twitter, modelling happy vs sad sentiment, and show that in some circumstances this outperforms similar Naive Bayes models by more than 10%. We also propose an extension to allow the modelling of differ- ent sentiment distributions in different geographic regions, while incorporating information from neighbouring regions. We outline the considerations when creating a system analysing Twitter data and present a scalable system of data acquisi- tion and prediction that can monitor the sentiment of tweets in real time.

#### PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Marc Peter Deisenroth, Carl Edward Rasmussen, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.

**Comment:** web site

#### Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning

Marc Peter Deisenroth, Carl Edward Rasmussen, Dieter Fox, June 2011. (In 9th International Conference on Robotics: Science & Systems). Los Angeles, CA, USA.

Abstract▼ URL

Over the last years, there has been substantial progress in robust manipulation in unstructured environments. The long-term goal of our work is to get away from precise, but very expensive robotic systems and to develop affordable, potentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful of trials - from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the work space of a Kinect-style depth camera, we use a model-based reinforcement learning technique. Our learning method is data efficient, reduces model bias, and deals with several noise sources in a principled way during long-term planning. We present a way of incorporating state-space constraints into the learning process and analyze the learning gain by exploiting the sequential structure of the stacking task.

**Comment:** project site

#### A Comparison of Human and Agent Reinforcement Learning in Partially Observable Domains

Finale Doshi-Velez, Zoubin Ghahramani, 2011. (In 33rd Annual Meeting of the Cognitive Science Society). Boston, MA.

Abstract▼ URL

It is commonly stated that reinforcement learning (RL) algorithms learn slower than humans. In this work, we investigate this claim using two standard problems from the RL literature. We compare the performance of human subjects to RL techniques. We find that context—the meaningfulness of the observations—–plays a significant role in the rate of human RL. Moreover, without contextual information, humans often fare much worse than classic algorithms. Comparing the detailed responses of humans and RL algorithms, we also find that humans appear to employ rather different strategies from standard algorithms, even in cases where they had indistinguishable performance to them. Our research both sheds light on human RL and provides insights for improving RL algorithms.

#### Additive Gaussian Processes

David Duvenaud, Hannes Nickisch, Carl Edward Rasmussen, 2011. (In Advances in Neural Information Processing Systems 24). Granada, Spain.

Abstract▼ URL

We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks.

#### The Indian buffet process: An introduction and review

Thomas L. Griffiths, Zoubin Ghahramani, April 2011. (Journal of Machine Learning Research).

Abstract▼ URL

The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes.

#### Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Joseph Hall, Carl Edward Rasmussen, Jan Maciejowski, 2011. (In Proceedings of 50th IEEE Conference on Decision and Control and European Control Conference).

Abstract▼ URL

The contribution described in this paper is an algorithm for learning nonlinear, reference tracking, control policies given no prior knowledge of the dynamical system and limited interaction with the system through the learning process. Concepts from the field of reinforcement learning, Bayesian statistics and classical control have been brought together in the formulation of this algorithm which can be viewed as a form indirect self tuning regulator. On the task of reference tracking using the inverted pendulum it was shown to yield generally improved performance on the best controller derived from the standard linear quadratic method using only 30 s of total interaction with the system. Finally, the algorithm was shown to work on the double pendulum proving its ability to solve nontrivial control tasks.

#### Robust Multi-Class Gaussian Process Classification

Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Multi-class Gaussian Processs Classifiers (MGPCs) are often affected by overfitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully indentifying data instances which are difficult to classify correctly in practice.

#### Bayesian Active Learning for Classification and Preference Learning

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Máté Lengyel, 2011. (arXiv).

Abstract▼ URL

Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.

#### A Kernel Approach to Tractable Bayesian Nonparametrics

Ferenc Huszár, Simon Lacoste-Julien, 2011. University of Cambridge,

Abstract▼ URL

Inference in popular nonparametric Bayesian models typically relies on sampling or other approximations. This paper presents a general methodology for constructing novel tractable nonparametric Bayesian methods by applying the kernel trick to inference in a parametric Bayesian model. For example, Gaussian process regression can be derived this way from Bayesian linear regression. Despite the success of the Gaussian process framework, the kernel trick is rarely explicitly considered in the Bayesian literature. In this paper, we aim to fill this gap and demonstrate the potential of applying the kernel trick to tractable Bayesian parametric models in a wider context than just regression. As an example, we present an intuitive Bayesian kernel machine for density estimation that is obtained by applying the kernel trick to a Gaussian generative model in feature space.

**Comment:** arXiv:1103.1761

#### Message Passing Algorithms for the Dirichlet Diffusion Tree

David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani, 2011. (In 28th International Conference on Machine Learning).

Abstract▼ URL

We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree (Neal, 2003), a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven’t seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

**Comment:** web site

#### Pitman-Yor Diffusion Trees

David A. Knowles, Zoubin Ghahramani, 2011. (In 27th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.

**Comment:** web site

#### Nonparametric Bayesian Sparse Factor Models with application to Gene Expression modelling.

David A. Knowles, Zoubin Ghahramani, 2011. (Annals of Applied Statistics).

Abstract▼ URL

A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model’s utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.

#### Non-conjugate Variational Message Passing for Multinomial and Binary Regression

David A. Knowles, Thomas P. Minka, 2011. (In Advances in Neural Information Processing Systems 25).

Abstract▼ URL

Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.

**Comment:** web site supplementary

#### Approximate Inference for the Loss-Calibrated Bayesian

Simon Lacoste-Julien, Ferenc Huszár, Zoubin Ghahramani, April 2011. (In 14th International Conference on Artificial Intelligence and Statistics). Edited by Geoff Gordon, David Dunson. Fort Lauderdale, FL, USA. Journal of Machine Learning Research.

Abstract▼ URL

We consider the problem of approximate inference in the context of Bayesian decision theory. Traditional approaches focus on approximating general properties of the posterior, ignoring the decision task – and associated losses – for which the posterior could be used. We argue that this can be suboptimal and propose instead to *loss-calibrate* the approximate inference methods with respect to the decision task at hand. We present a general framework rooted in Bayesian decision theory to analyze approximate inference from the perspective of losses, opening up several research directions. As a first loss-calibrated approximate inference attempt, we propose an EM-like algorithm on the Bayesian posterior risk and show how it can improve a standard approach to Gaussian process classification when losses are asymmetric.

#### Empirical models of spiking in neural populations

J. H. Macke, L. Busing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, M. Sahani, December 2011. (In Advances in Neural Information Processing Systems 25). Granada, Spain.

Abstract▼

Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-of- fit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts.

#### Gaussian Process Training with Input Noise

Andrew McHutchon, Carl Edward Rasmussen, 2011. (In Advances in Neural Information Processing Systems 24). Edited by J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q. Weinberger. Granada, Spain. Curran Associates, Inc..

Abstract▼ URL

In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise hyperparameters are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood, and allow estimation of the noise levels on each input dimension. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.

#### Generalised Bayesian Matrix Factorisation Models

Shakir Mohamed, 2011. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others. This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models. The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression. Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation.

#### Distinct epigenomic features in human cardiomyopathy

Mehregan Movassagh, Mun-Kit Choy, David A. Knowles, Lina Cordeddu, Syed Haider, Thomas Down, Lee Siggens, Ana Vujic, Ilenia Simeoni, Chris Penkett, Martin Goddard, Pietro Lio, Martin Bennett, Roger Foo, 2011. (Circulation, American Heart Association).

Abstract▼ URL

Background. The epigenome refers to marks on the genome including DNA methylation and histone modifications that regulate the expression of underlying genes. A consistent profile of gene expression changes in end- stage cardiomyopathy led us to hypothesise that distinct global patterns of the epigenome may also exist. Methods and Results. We constructed genome-wide maps of DNA methylation and Histone-3 Lysine-36 tri-methylation (H3K36me3)-enrichment for cardiomyopathic and normal human hearts. 506Mb of sequence per library was generated by high-throughput sequencing, covering 24 million out of the 28 million CG di-nucleotides in the human genome. DNA methylation was significantly different in promoter CpG-islands (CGI), intra-genic CGI, gene bodies and H3K36me3-enriched regions of the genome. Moreover DNA methylation differences were present in promoters of upregulated genes but not down-regulated genes. The profile of H3K36me3-enrichment itself was also significantly different in protein-coding regions of the genome. Conclusions. Distinct epigenomic patterns exist in important DNA elements of the human cardiac genome in end-stage cardiomyopathy. If epigenomic patterns track with disease progression, assays for the epigenome may be more useful than quantification of mRNA for assessing prognosis in heart failure. These results open up an important new horizon of research and further studies will be needed to determine how epigenomics contribute to altered gene expression in cardiomyopathy.

#### Projective limit random probabilities on Polish spaces

Peter Orbanz, 2011. (Electron. J. Stat.).

Abstract▼ URL

A pivotal problem in Bayesian nonparametrics is the construction of prior distributions on the space M(V) of probability measures on a given domain V. In principle, such distributions on the infinite-dimensional space M(V) can be constructed from their finite-dimensional marginals—the most prominent example being the construction of the Dirichlet process from finite-dimensional Dirichlet distributions. This approach is both intuitive and applicable to the construction of arbitrary distributions on M(V), but also hamstrung by a number of technical difficulties. We show how these difficulties can be resolved if the domain V is a Polish topological space, and give a representation theorem directly applicable to the construction of any probability distribution on M(V) whose first moment measure is well-defined. The proof draws on a projective limit theorem of Bochner, and on properties of set functions on Polish spaces to establish countable additivity of the resulting random probabilities.

#### A Unified Framework for Resource-Bounded Agents Interacting with an Unknown Environment

Pedro A. Ortega, 2011. Department of Engineering, University of Cambridge,

Abstract▼ URL

The aim of this thesis is to present a mathematical framework for conceptualizing and constructing adaptive autonomous systems under resource constraints. The first part of this thesis contains a concise presentation of the foundations of classical agency: namely the formalizations of decision making and learning. Decision making includes: (a) subjective expected utility (SEU) theory, the framework of decision making under uncertainty; (b) the maximum SEU principle to choose the optimal solution; and (c) its application to the design of autonomous systems, culminating in the Bellman optimality equations. Learning includes: (a) Bayesian probability theory, the theory for reasoning under uncertainty that extends logic; and (b) Bayes-Optimal agents, the application of Bayesian probability theory to the design of optimal adaptive agents. Then, two major problems of the maximum SEU principle are highlighted: (a) the prohibitive computational costs and (b) the need for the causal precedence of the choice of the policy. The second part of this thesis tackles the two aforementioned problems. First, an information-theoretic notion of resources in autonomous systems is established. Second, a framework for resource-bounded agency is introduced. This includes: (a) a maximum bounded SEU principle that is derived from a set of axioms of utility; (b) an axiomatic model of probabilistic causality, which is applied for the formalization of autonomous systems having uncertainty over their policy and environment; and (c) the Bayesian control rule, which is derived from the maximum bounded SEU principle and the model of causality, implementing a stochastic adaptive control law that deals with the case where autonomous agents are uncertain about their policy and environment.

#### Information, Utility and Bounded Rationality

Pedro A. Ortega, Daniel A. Braun, 2011. (In The fourth conference on artificial general intelligence). Springer-Verlag. Lecture Notes on Artificial Intelligence.

Abstract▼ URL

Perfectly rational decision-makers maximize expected utility, but crucially ignore the resource costs incurred when determining optimal actions. Here we employ an axiomatic framework for bounded rational decision-making based on a thermodynamic interpretation of resource costs as information costs. This leads to a variational free utility principle akin to thermodynamical free energy that trades off utility and information costs. We show that bounded optimal control solutions can be derived from this variational principle, which leads in general to stochastic policies. Furthermore, we show that risk-sensitive and robust (minimax) control schemes fall out naturally from this framework if the environment is considered as a bounded rational and perfectly rational opponent, respectively. When resource costs are ignored, the maximum expected utility principle is recovered.

#### Reinforcement Learning and the Bayesian Control Rule

Pedro A. Ortega, Daniel A. Braun, Simon Godsill, 2011. (In The fourth conference on artificial general intelligence). Springer-Verlag. Lecture Notes on Artificial Intelligence.

Abstract▼ URL

We present an actor-critic scheme for reinforcement learning in complex domains. The main contribution is to show that planning and I/O dynamics can be separated such that an intractable planning problem reduces to a simple multi-armed bandit problem, where each lever stands for a potentially arbitrarily complex policy. Furthermore, we use the Bayesian control rule to construct an adaptive bandit player that is universal with respect to a given class of optimal bandit players, thus indirectly constructing an adaptive agent that is universal with respect to a given class of policies.

#### Dynamical Segmentation of single trials from population neural data

B. Petreska, B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, M. Sahani, December 2011. (In Advances in Neural Information Processing Systems 25). Granada, Spain.

Abstract▼

Simultaneous recordings of many neurons embedded within a recurrently-connected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identified by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model - in which multiple linear dynamical laws approximate and nonlinear and potentially non-stationary dynamical process - is able to distinguish dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identified without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the firing rate of an isolated neuron based on the firing rates of others, suggesting that it captures more of the “Shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reflect its computational role.

#### On the computability and complexity of Bayesian reasoning

Daniel M. Roy, 2011. (In NIPS Workshop on Philosophy and Machine Learning).

Abstract▼ URL

If we consider the claim made by some cognitive scientists that the mind performs Bayesian reasoning, and if we simultaneously accept the Physical Church-Turing thesis and thus believe that the computational power of the mind is no more than that of a Turing machine, then what limitations are there to the reasoning abilities of the mind? I give an overview of joint work with Nathanael Ackerman (Harvard, Mathematics) and Cameron Freer (MIT, CSAIL) that bears on the computability and complexity of Bayesian reasoning. In particular, we prove that conditional probability is in general not computable in the presence of continuous random variables. However, in light of additional structure in the prior distribution, such as the presence of certain types of noise, or of exchangeability, conditioning is possible. These results cover most of statistical practice. At the workshop on Logic and Computational Complexity, we presented results on the computational complexity of conditioning, embedding sharp-P-complete problems in the task of computing conditional probabilities for diffuse continuous random variables. This work complements older work. For example, under cryptographic assumptions, the computational complexity of producing samples and computing probabilities was separated by Ben-David, Chor, Goldreich and Luby. In recent work, we also make use of cryptographic assumptions to show that different representations of exchangeable sequences may have vastly different complexity. However, when faced with an adversary that is computational bounded, these different representations have the same complexity, highlighting the fact that knowledge representation and approximation play a fundamental role in the possibility and plausibility of Bayesian reasoning.

#### Scalable Inference for Structured Gaussian Process Models

Yunus Saatçi, 2011. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

The generic inference and learning algorithm for Gaussian Process (GP) regression has O(N3) runtime and O(N2) memory complexity, where N is the number of observations in the dataset. Given the computational resources available to a present-day workstation, this implies that GP regression simply cannot be run on large datasets. The need to use non- Gaussian likelihood functions for tasks such as classification adds even more to the computational burden involved. The majority of algorithms designed to improve the scaling of GPs are founded on the idea of approximating the true covariance matrix, which is usually of rank N, with a matrix of rank P, where P<<N. Typically, the true training set is replaced with a smaller, representative (pseudo-) training set such that a specific measure of information loss is minimized. These algorithms typically attain O(P2N) runtime and O(PN) space complexity. They are also general in the sense that they are designed to work with any covariance function. In essence, they trade off accuracy with computational complexity. The central contribution of this thesis is to improve scaling instead by exploiting any structure that is present in the covariance matrices generated by particular covariance functions. Instead of settling for a kernel-independent accuracy/complexity trade off, as is done in much the literature, we often obtain accuracies close to, or exactly equal to the full GP model at a fraction of the computational cost. We define a structured GP as any GP model that is endowed with a kernel which produces structured covariance matrices. A trivial example of a structured GP is one with the linear regression kernel. In this case, given inputs living in RD, the covariance matrices generated have rank D – this results in significant computational gains in the usual case where D<<N. Another case arises when a stationary kernel is evaluated on equispaced, scalar inputs. This results in Toeplitz covariance matrices and all necessary computations can be carried out exactly in O(N log N). This thesis studies four more types of structured GP. First, we comprehensively review the case of kernels corresponding to Gauss-Markov processes evaluated on scalar inputs. Using state-space models we show how (generalised) regression (including hyperparameter learning) can be performed in O(N log N) runtime and O(N) space. Secondly, we study the case where we introduce block structure into the covariance matrix of a GP time-series model by assuming a particular form of nonstationarity a priori. Third, we extend the efficiency of scalar Gauss-Markov processes to higher-dimensional input spaces by assuming additivity. We illustrate the connections between the classical backfitting algorithm and approximate Bayesian inference techniques including Gibbs sampling and variational Bayes. We also show that it is possible to relax the rather strong assumption of additivity without sacrificing O(N log N) complexity, by means of a projection-pursuit style GP regression model. Finally, we study the properties of a GP model with a tensor product kernel evaluated on a multivariate grid of inputs locations. We show that for an arbitrary (regular or irregular) grid the resulting covariance matrices are Kronecker and full GP regression can be implemented in O(N) time and memory usage. We illustrate the power of these methods on several real-world regression datasets which satisfy the assumptions inherent in the structured GP employed. In many cases we obtain performance comparable to the generic GP algorithm. We also analyse the performance degradation when these assumptions are not met, and in several cases show that it is comparable to that observed for sparse GP methods. We provide similar results for regression tasks with non-Gaussian likelihoods, an extension rarely addressed by sparse GP techniques.

#### Dichotomous cellular properties of mouse orexin/hypocretin neurons

Cornelia Schone, Anne Venner, David A. Knowles, Mahesh M Karnani, Denis Burdakov, 2011. (The Journal of Physiology).

Abstract▼ URL

Hypothalamic hypocretin/orexin (hcrt/orx) neurons recently emerged as critical regulators of sleep-wake cycles, reward-seeking, and body energy balance. However, at the level of cellular and network properties, it remains unclear whether hcrt/orx neurons are one homogenous population, or whether there are several distinct types of hcrt/orx cells. Here, we collated diverse structural and functional information about individual hcrt/orx neurons in mouse brain slices, by combining patch-clamp analysis of spike firing, membrane currents, and synaptic inputs with confocal imaging of cell shape and subsequent 3-dimensional Sholl analysis of dendritic architecture. Statistical cluster analysis of intrinsic firing properties revealed that hcrt/orx neurons fall into two distinct types. These two cell types also differ in the complexity of their dendritic arbour, the strength of AMPA and GABAA receptor-mediated synaptic drive that they receive, and the density of low-threshold, 4-aminopyridine-sensitive, transient K+ current. Our results provide quantitative evidence that, at the cellular level, the mouse hcrt/orx system is composed of two classes of neurons with different firing properties, morphologies, and synaptic input organization.

#### The Complexity of Inference in Latent Dirichlet Allocation

David Sontag, Daniel M. Roy, 2011. (In Advances in Neural Information Processing Systems 24). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question.

#### Gaussian Processes for State Space Models and Change Point Detection

Ryan Darby Turner, 2011. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

This thesis details several applications of Gaussian processes (GPs) for enhanced time series modeling. We first cover different approaches for using Gaussian processes in time series problems. These are extended to the state space approach to time series in two different problems. We also combine Gaussian processes and Bayesian online change point detection (BOCPD) to increase the generality of the Gaussian process time series methods. These methodologies are evaluated on predictive performance on six real world data sets, which include three environmental data sets, one financial, one biological, and one from industrial well drilling. Gaussian processes are capable of generalizing standard linear time series models. We cover two approaches: the Gaussian process time series model (GPTS) and the autoregressive Gaussian process (ARGP). We cover a variety of methods that greatly reduce the computational and memory complexity of Gaussian process approaches, which are generally cubic in computational complexity. Two different improvements to state space based approaches are covered. First, Gaussian process inference and learning (GPIL) generalizes linear dynamical systems (LDS), for which the Kalman filter is based, to general nonlinear systems for nonparametric system identification. Second, we address pathologies in the unscented Kalman filter (UKF). We use Gaussian process optimization (GPO) to learn UKF settings that minimize the potential for sigma point collapse. We show how to embed mentioned Gaussian process approaches to time series into a change point framework. Old data, from an old regime, that hinders predictive performance is automatically and elegantly phased out. The computational improvements for Gaussian process time series approaches are of even greater use in the change point framework. We also present a supervised framework learning a change point model when change point labels are available in training. These mentioned methodologies significantly improve predictive performance on the diverse set of data sets selected.

#### Two problems with variational expectation maximisation for time-series models

Richard E. Turner, Maneesh Sahani, 2011. (In Bayesian Time series models). Edited by D. Barber, T. Cemgil, S. Chiappa. Cambridge University Press.

Abstract▼ URL

Variational methods are a key component of the approximate inference and learning toolbox. These methods fill an important middle ground, retaining distributional information about uncertainty in latent variables, unlike maximum a posteriori methods (MAP), and yet generally requiring less computational time than Monte Carlo Markov Chain methods. In particular the variational Expectation Maximisation (vEM) and variational Bayes algorithms, both involving variational optimisation of a free-energy, are widely used in time-series modelling. Here, we investigate the success of vEM in simple probabilistic time-series models. First we consider the inference step of vEM, and show that a consequence of the well-known compactness property of variational inference is a failure to propagate uncertainty in time, thus limiting the usefulness of the retained distributional information. In particular, the uncertainty may appear to be smallest precisely when the approximation is poorest. Second, we consider parameter learning and analytically reveal systematic biases in the parameters found by vEM. Surprisingly, simpler variational approximations (such a mean-field) can lead to less bias than more complicated structured approximations.

#### Demodulation as Probabilistic Inference

Richard E. Turner, Maneesh Sahani, 2011. (Transactions on Audio, Speech and Language Processing).

Abstract▼ URL

Demodulation is an ill-posed problem whenever both carrier and envelope signals are broadband and unknown. Here, we approach this problem using the methods of probabilistic inference. The new approach, called Probabilistic Amplitude Demodulation (PAD), is computationally challenging but improves on existing methods in a number of ways. By contrast to previous approaches to demodulation, it satisfies five key desiderata: PAD has soft constraints because it is probabilistic; PAD is able to automatically adjust to the signal because it learns parameters; PAD is user-steerable because the solution can be shaped by user-specific prior information; PAD is robust to broad-band noise because this is modelled explicitly; and PAD’s solution is self-consistent, empirically satisfying a Carrier Identity property. Furthermore, the probabilistic view naturally encompasses noise and uncertainty, allowing PAD to cope with missing data and return error bars on carrier and envelope estimates. Finally, we show that when PAD is applied to a bandpass-filtered signal, the stop-band energy of the inferred carrier is minimal, making PAD well-suited to sub-band demodulation.

#### Probabilistic amplitude and frequency demodulation

Richard E. Turner, Maneesh Sahani, 2011. (In Advances in Neural Information Processing Systems 24). The MIT Press.

Abstract▼ URL

A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequency-demodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings.

#### Generalised Wishart Processes

Andrew Gordon Wilson, Zoubin Ghahramani, 2011. (In 27th Conference on Uncertainty in Artificial Intelligence).

Abstract▼ URL

We introduce a new stochastic process called the generalised Wishart process (GWP). It is a collection of positive semi-definite random matrices indexed by any arbitrary input variable. We use this process as a prior over dynamic (e.g. time varying) covariance matrices. The GWP captures a diverse class of covariance dynamics, naturally hanles missing data, scales nicely with dimension, has easily interpretable parameters, and can use input variables that include covariates other than time. We describe how to construct the GWP, introduce general procedures for inference and prediction, and show that it outperforms its main competitor, multivariate GARCH, even on financial data that especially suits GARCH.

**Comment:** Supplementary Material, Best Student Paper Award

#### Gaussian Process Regression Networks

Andrew Gordon Wilson, David A Knowles, Zoubin Ghahramani, October 19 2011. Department of Engineering, University of Cambridge, Cambridge, UK.

Abstract▼ URL

We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.

**Comment:** arXiv:1110.4411

#### An L1-regularized logistic model for detecting short-term neuronal interactions.

M. Zhao, A. P. Batista, J. P. Cunningham, C. A. Chestek, Z. Rivera-Alvidrez, R. Kalmar, S. I. Ryu, K. V. Shenoy, S. Iyengar, 2011. (Journal of Computational Neuroscience). **DOI**: 10.1007/s10827-011-0365-5. **Note**: In Press..

Abstract▼ URL

Interactions among neurons are a key com- ponent of neural signal processing. Rich neural data sets potentially containing evidence of interactions can now be collected readily in the laboratory, but existing analysis methods are often not sufficiently sensitive and specific to reveal these interactions. Generalized linear models offer a platform for analyzing multi-electrode recordings of neuronal spike train data. Here we suggest an L1-regularized logistic regression model (L1L method) to detect short-term (order of 3ms) neuronal interactions. We estimate the parameters in this model using a coordinate descent algorithm, and determine the optimal tuning parameter using a Bayesian Information Criterion. Simulation studies show that in general the L1L method has better sensitivities and specificities than those of the traditional shuffle-corrected cross-correlogram (covariogram) method. The L1L method is able to detect excitatory interactions with both high sensitivity and specificity with reasonably large recordings, even when the magnitude of the interactions is small; similar results hold for inhibition given sufficiently high baseline firing rates. Our study also suggests that the false positives can be further removed by thresholding, because their magnitudes are typically smaller than true interactions. Simulations also show that the L1L method is somewhat robust to partially observed networks. We apply the method to multi-electrode recordings collected in the monkey dorsal premotor cortex (PMd) while the animal prepares to make reaching arm movements. The results show that some neurons interact differently depending on task conditions. The stronger interactions detected with our L1L method were also visible using the covariogram method.

## 2010

#### Tree-Structured Stick Breaking for Hierarchical Data

R. P. Adams, Zoubin Ghahramani, Michael I. Jordan, 2010. (In Advances in Neural Information Processing Systems 23). The MIT Press.

Abstract▼ URL

Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data.

#### Learning the Structure of Deep Sparse Graphical Models

R. P. Adams, H. Wallach, Zoubin Ghahramani, May 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Edited by Yee Whye Teh, Mike Titterington. Chia Laguna, Sardinia, Italy.

Abstract▼ URL

Deep belief networks are a powerful way to model complex probability distributions. However, it is difficult to learn the structure of a belief network, particularly one with hidden units. The Indian buffet process has been used as a nonparametric Bayesian prior on the structure of a directed belief network with a single infinitely wide hidden layer. Here, we introduce the cascading Indian buffet process (CIBP), which provides a prior on the structure of a layered, directed belief network that is unbounded in both depth and width, yet allows tractable inference. We use the CIBP prior with the nonlinear Gaussian belief network framework to allow each unit to vary its behavior between discrete and continuous representations. We use Markov chain Monte Carlo for inference in this model and explore the structures learned on image data.

**Comment:** Winner of the Best Paper Award

#### A minimum relative entropy principle for adaptive control in linear quadratic regulators

Daniel A. Braun, Pedro A. Ortega, 2010. (In Proceedings of the 7th international conference on informatics in control, automation and robotics).

Abstract▼

The design of optimal adaptive controllers is usually based on heuristics, because solving Bellman’s equations over information states is notoriously intractable. Approximate adaptive controllers often rely on the principle of certainty-equivalence where the control process deals with parameter point estimates as if they represented “true” parameter values. Here we present a stochastic control rule instead where controls are sampled from a posterior distribution over a set of probabilistic input-output models and the true model is identified by Bayesian inference. This allows reformulating the adaptive control problem as an inference and sampling problem derived from a minimum relative entropy principle. Importantly, inference and action sampling both work forward in time and hence such a Bayesian adaptive controller is applicable on-line. We demonstrate the improved performance that can be achieved by such an approach for linear quadratic regulator examples.

#### Scaling the iHMM: Parallelization versus Hadoop

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, Zoubin Ghahramani, 2010. (In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology). Bradford, UK. IEEE Computer Society. **DOI**: 10.1109/CIT.2010.223. **ISBN**: 978-0-7695-4108-2.

Abstract▼ URL

This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM Beal, Ghahramani and Rasmussen, 2002, in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program. We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.

#### Cortical preparatory activity: Representation of movement or first cog in a dynamical machine?

M. M. Churchland, J. P. Cunningham, M. T. Kaufman, S. I. Ryu, K. V. Shenoy., 2010. (Neuron).

Abstract▼ URL

The motor cortices are active during both movement and movement preparation. A common assumption is that preparatory activity constitutes a subthreshold form of movement activity: a neuron active during rightward movements becomes modestly active during preparation of a rightward movement. We asked whether this pattern of activity is, in fact, observed. We found that it was not: at the level of a single neuron, preparatory tuning was weakly correlated with movement-period tuning. Yet, somewhat paradoxically, preparatory tuning could be captured by a preferred direction in an abstract “space” that described the population-level pattern of movement activity. In fact, this relationship accounted for preparatory responses better than did traditional tuning models. These results are expected if preparatory activity provides the initial state of a dynamical system whose evolution produces movement activity. Our results thus suggest that preparatory activity may not represent specific factors, and may instead play a more mechanistic role.

#### Stimulus onset quashes neural variability: a widespread cortical phenomenon

M. M. Churchland, B. M. Yu, J. P. Cunningham, L. P. Sugrue, M. R. Cohen, G. S. Corrado, W. T. Newsome, A. M. Clark, P. Hosseini, B. B. Scott, D. C. Bradley, M. A. Smith, A. Kohn, J. A. Movshon, K. M. Armstrong, T. Moore, S. W. Chang, L. H. Snyder, S. G. Lisberger, N. J. Priebe, I. M. Finn, D. Ferster, S. I. Ryu, G. Santhanam, M. Sahani, K. V. Shenoy., 2010. (Nature Neuro).

Abstract▼ URL

Neural responses are typically characterized by computing the mean firing rate, but response variability can exist across trials. Many studies have examined the effect of a stimulus on the mean response, but few have examined the effect on response variability. We measured neural variability in 13 extracellularly recorded datasets and one intracellularly recorded dataset from seven areas spanning the four cortical lobes in monkeys and cats. In every case, stimulus onset caused a decline in neural variability. This occurred even when the stimulus produced little change in mean firing rate. The variability decline was observed in membrane potential recordings, in the spiking of individual neurons and in correlated spiking variability measured with implanted 96-electrode arrays. The variability decline was observed for all stimuli tested, regardless of whether the animal was awake, behaving or anaesthetized. This widespread variability decline suggests a rather general property of cortex, that its state is stabilized by an input.

#### Efficient Reinforcement Learning using Gaussian Processes

Marc Peter Deisenroth, 2010. Karlsruhe Institute of Technology, Karlsruhe, Germany.

Abstract▼ URL

In many research areas, including control and medical applications, we face decision-making problems where data are limited and/or the underlying generative process is complicated and partially unknown. In these scenarios, we can profit from algorithms that learn from data and aid decision making. Reinforcement learning (RL) is a general computational approach to experience-based goal-directed learning for sequential decision making under uncertainty. However, RL often lacks efficiency in terms of the number of required trials when no task-specific knowledge is available. This lack of efficiency makes RL often inapplicable to (optimal) control problems. Thus, a central issue in RL is to speed up learning by extracting more information from available experience. The contributions of this dissertation are threefold: 1. We propose PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available. PILCO is based on well-established ideas from statistics and machine learning. PILCO’s key ingredient is a probabilistic dynamics model learned from data, which is implemented by a Gaussian process (GP). The GP carefully quantifies knowledge by a probability distribution over plausible dynamics models. By averaging over all these models during long-term planning and decision making, PILCO takes uncertainties into account in a principled way and, therefore, reduces model bias, a central problem in model-based RL. 2. Due to its generality and efficiency, PILCO can be considered a conceptual and practical approach to jointly learning models and controllers when expert knowledge is difficult to obtain or simply not available. For this scenario, we investigate PILCO’s properties its applicability to challenging real and simulated nonlinear control problems. For example, we consider the tasks of learning to swing up a double pendulum attached to a cart or to balance a unicycle with five degrees of freedom. Across all tasks we report unprecedented automation and an unprecedented learning efficiency for solving these tasks. 3. As a step toward pilco’s extension to partially observable Markov decision processes, we propose a principled algorithm for robust filtering and smoothing in GP dynamic systems. Unlike commonly used Gaussian filters for nonlinear systems, it does neither rely on function linearization nor on finite-sample representations of densities. Our algorithm profits from exact moment matching for predictions while keeping all computations analytically tractable. We present experimental evidence that demonstrates the robustness and the advantages of our method over unscented Kalman filters, the cubature Kalman filter, and the extended Kalman filter.

#### No Correlation Between Childhood Maltreatment and Telomere Length.

Daniel Glass, Leopold Parts, David A. Knowles, Abraham Aviv, Tim D. Spector, 2010. (Biological Psychiatry).

Abstract▼

Telomeres are lengths of repetitive DNA that cap the ends of chromosomes. They protect the ends of the chromosome and shorten with each cell division. Short leukocyte telomere length has been related to a number of age-related diseases. In addition, shorter telomere length has been associated with environmental factors such as smoking and lack of exercise. In a recent issue of Biological Psychiatry, Tyrka et al. (4) published a report suggesting a link between maltreatment in childhood and telomere shortening in 31 subjects. Individuals who had suffered maltreatment had telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had not been abused.

#### Dirichlet Process Gaussian Mixture Models: Choice of the base distribution

Dilan Görür, Carl Edward Rasmussen, July 2010. (Journal of Computer Science and Technology). Beijing, China. Science Press. **DOI**: 10.1007/s11390-010-9355-8.

Abstract▼ URL

In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of finding the “correct” number of mixture components by assuming infinitely many components. In this paper Dirichlet process mixture (DPM) models are cast as infinite mixture models and inference using Markov chain Monte Carlo is described. The specification of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM). We compare computational efficiency and modeling performance of DPGMM defined using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational effort.

#### Variational inference for nonparametric multiple clustering

Y. Guan, J. G. Dy, D. Niu, Z. Ghahramani, July 2010. (In KDD10 Workshop on Discovering, Summarizing, and Using Multiple Clusterings). Washington, DC, USA.

Abstract▼ URL

Most clustering algorithms produce a single clustering solution. Similarly, feature selection for clustering tries to find one feature subset where one interesting clustering solution resides. However, a single data set may be multi-faceted and can be grouped and interpreted in many different ways, especially for high dimensional data, where feature selection is typically needed. Moreover, different clustering solutions are interesting for different purposes. Instead of committing to one clustering solution, in this paper we introduce a probabilistic nonparametric Bayesian model that can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. We provide a variational inference approach to learn the features and clustering partitions in each view. Our model allows us not only to learn the multiple clusterings and views but also allows us to automatically learn the number of views and the number of clusters in each view.

#### Mind reading by machine learning: A doubly Bayesian method for inferring mental representations

Ferenc Huszár, Uta Noppeney, Máté Lengyel, August 2010. (In The Proceedings of the 32nd Annual Meeting of the Cognitive Science Society). Edited by S. Ohlsson, R. Catrambone. Austin, TX, USA. The Cognitive Science Society.

Abstract▼ URL

A central challenge in cognitive science is to measure and quantify the mental representations humans develop — in other words, to ‘read’ subject’s minds. In order to eliminate potential biases in reporting mental contents due to verbal elaboration, subjects’ responses in experiments are often limited to binary decisions or discrete choices that do not require conscious reflection upon their mental contents. However, it is unclear what such impoverished data can tell us about the potential richness and dynamics of subjects’ mental representations. To address this problem, we used ideal observer models that formalise choice behaviour as (quasi-)Bayes-optimal, given subjects’ representations in long-term memory, acquired through prior learning, and the stimuli currently available to them. Bayesian inversion of such ideal observer models allowed us to infer subjects’ mental representation from their choice behaviour in a variety of psychophysical tasks. The inferred mental representations also allowed us to predict future choices of subjects with reasonable accuracy, even in tasks that were different from those in which the representations were estimated. These results demonstrate a significant potential in standard binary decision tasks to recover detailed information about subjects’ mental representations

**Comment:** Supplementary material available here.

#### Bayesian Knowledge Corroboration with Logical Rules and User Feedback

G. Kasneci, J. Van Gael, T. Graepel, R. Herbrich, September 2010. (In European Conference on Machine Learning (ECML)). Barcelona, Spain.

Abstract▼ URL

Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS [1]. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.

#### Modeling skin and ageing phenotypes using latent variable models in Infer.NET

David A. Knowles, Leopold Parts, Daniel Glass, John M. Winn, 2010. (In NIPS Workshop: Predictive Models in Personalized Medicine Workshop).

Abstract▼ URL

We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study. We address various data modeling problems include high missingness, heterogeneous data, and repeat observations. We compare the proposed models in terms of their performance at predicting disease labels and symptoms from available explanatory variables, concluding that factor analysis type models have the strongest statistical performance in this setting. We show that such models can be combined with regression components for improved interpretability.

**Comment:** web site

#### Sparse Spectrum Gaussian Process Regression

Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal Figueiras-Vidal, June 2010. (Journal of Machine Learning Research).

Abstract▼ URL

We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the *spectral representation* of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class.

#### Kronecker Graphs: An Approach to Modeling Networks

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, Z. Ghahramani, 2010. (Journal of Machine Learning Research).

Abstract▼ URL

How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”.First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks.We then present KRONFIT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, KRONFIT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that KRONFIT finds accurate parameters that very well mimic the properties of target networks. In fact, using just four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synthetic graphs can be used for null-models, anonymization, extrapolations, and graph summarization.

#### Gene function prediction from synthetic lethality networks via ranking on demand

C. Lippert, Z. Ghahramani, K. Borgwardt, 2010. (Bioinformatics).

Abstract▼ URL

Motivation: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks. Results: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.

#### Gaussian Mixture Modeling with Gaussian Process Latent Variable Models

Hannes Nickisch, Carl Edward Rasmussen, September 2010. (In Proceedings of the 32nd DAGM Symposium on Pattern Recognition). Darmstadt, Germany. Springer. Lecture Notes in Computer Science (LNCS). **DOI**: 10.1007/978-3-642-15986-2_28.

Abstract▼ URL

Density modeling is notoriously difficult for high dimensional data. One approach to the problem is to search for a lower dimensional manifold which captures the main characteristics of the data. Recently, the Gaussian Process Latent Variable Model (GPLVM) has successfully been used to find low dimensional manifolds in a variety of complex data. The GPLVM consists of a set of points in a low dimensional latent space, and a stochastic map to the observed space. We show how it can be interpreted as a density model in the observed space. However, the GPLVM is not trained as a density model and therefore yields bad density estimates. We propose a new training strategy and obtain improved generalisation performance and better density estimates in comparative evaluations on several benchmark data sets.

#### Bayesian Nonparametric Models

Peter Orbanz, Yee-Whye Teh, 2010. (In Encyclopedia of Machine Learning). Springer.

#### A conversion between utility and information

Pedro A. Ortega, Daniel A. Braun, 2010. (In The third conference on artificial general intelligence). Paris. Atlantis Press.

Abstract▼ URL

Rewards typically express desirabilities or preferences over a set of alternatives. Here we propose that rewards can be defined for any probability distribution based on three desiderata, namely that rewards should be real- valued, additive and order-preserving, where the later implies that more probable events should also be more desirable. Our main result states that rewards are then uniquely determined by the negative information content. To analyze stochastic processes, we define the utility of a realization as its reward rate. Under this interpretation, we show that the expected utility of a stochastic process is its negative entropy rate. Furthermore, we apply our results to analyze agent-environment interactions. We show that the expected utility that will actually be achieved by the agent is given by the negative cross-entropy from the input-output (I/O) distribution of the coupled interaction system and the agent’s I/O distribution. Thus, our results allow for an information-theoretic interpretation of the notion of utility and the characterization of agent-environment interactions in terms of entropy dynamics.

#### A Bayesian rule for adaptive control based on causal interventions

Pedro A. Ortega, Daniel A. Braun, 2010. (In The third conference on artificial general intelligence). Paris. Atlantis Press.

Abstract▼ URL

Explaining adaptive behavior is a central problem in artificial intelligence research. Here we formalize adaptive agents as mixture distributions over sequences of inputs and outputs (I/O). Each distribution of the mixture constitutes a “possible world”, but the agent does not know which of the possible worlds it is actually facing. The problem is to adapt the I/O stream in a way that is compatible with the true world. A natural measure of adaptation can be obtained by the Kullback Leibler (KL) divergence between the I/O distribution of the true world and the I/O distribution expected by the agent that is uncertain about possible worlds. In the case of pure input streams, the Bayesian mixture provides a well-known solution for this problem. We show, however, that in the case of I/O streams this solution breaks down, because outputs are issued by the agent itself and require a different probabilistic syntax as provided by intervention calculus. Based on this calculus, we obtain a Bayesian control rule that allows modeling adaptive behavior with mixture distributions over I/O streams. This rule might allow for a novel approach to adaptive control based on a minimum KL-principle.

#### A minimum relative entropy principle for learning and acting

Pedro A. Ortega, Daniel A. Braun, 2010. (Journal of Artificial Intelligence Research). **DOI**: 10.1613/jair.3062.

Abstract▼ URL

This paper proposes a method to construct an adaptive agent that is univemacmacrsal with respect to a given class of experts, where each expert is designed specifically for a particular environment. This adaptive control problem is formalized as the problem of minimizing the relative entropy of the adaptive agent from the expert that is most suitable for the unknown environment. If the agent is a passive observer, then the optimal solution is the well-known Bayesian predictor. However, if the agent is active, then its past actions need to be treated as causal interventions on the I/O stream rather than normal probability conditions. Here it is shown that the solution to this new variational problem is given by a stochastic controller called the Bayesian control rule, which implements adaptive behavior as a mixture of experts. Furthermore, it is shown that under mild assumptions, the Bayesian control rule converges to the control law of the most suitable expert.

#### An axiomatic formalization of bounded rationality based on a utility-information equivalence

Pedro A. Ortega, Daniel A. Braun, 2010. Dept. of Engineering, University of Cambridge,

Abstract▼ URL

Classic decision-theory is based on the maximum expected utility (MEU) principle, but crucially ignores the resource costs incurred when determining optimal decisions. Here we propose an axiomatic framework for bounded decision-making that considers resource costs. Agents are formalized as probability measures over input-output streams. We postulate that any such probability measure can be assigned a corresponding conjugate utility function based on three axioms: utilities should be real-valued, additive and monotonic mappings of probabilities. We show that these axioms enforce a unique conversion law between utility and probability (and thereby, information). Moreover, we show that this relation can be characterized as a variational principle: given a utility function, its conjugate probability measure maximizes a free utility functional. Transformations of probability measures can then be formalized as a change in free utility due to the addition of new constraints expressed by a target utility function. Accordingly, one obtains a criterion to choose a probability measure that trades off the maximization of a target utility function and the cost of the deviation from a reference distribution. We show that optimal control, adaptive estimation and adaptive control problems can be solved this way in a resource-efficient way. When resource costs are ignored, the MEU principle is recovered. Our formalization might thus provide a principled approach to bounded rationality that establishes a close link to information theory.

#### Gaussian Processes for Machine Learning (GPML) Toolbox

Carl Edward Rasmussen, Hannes Nickisch, December 2010. (Journal of Machine Learning Research).

Abstract▼ URL

The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.

**Comment:** Toolbox avaiable from here. Implements algorithms from Rasmussen and Williams, 2006.

#### Traffic Classification in Information Poor Environments

C. Rotsos, J. Van Gael, A.W. Moore, Z. Ghahramani, July 2010. (In 1st International Workshop on Traffic Analysis and Classification (IWCMC '10)). Caen, France.

Abstract▼ URL

Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses *off-the-shelf* machine learning tools and treats them as *black-box* classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

#### Probabilistic Graphical Models for Semi-Supervised Traffic Classification

Charalampos Rotsos, Jurgen Van Gael, Andrew W. Moore, Zoubin Ghahramani, 2010. (In The 6th International Wireless Communications and Mobile Computing Conference). Caen, France.

Abstract▼ URL

Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses off-the-shelf machine learning tools and treats them as black-box classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

#### Gaussian Process Change Point Models

Yunus Saatçi, Ryan Turner, Carl Edward Rasmussen, June 2010. (In 27th International Conference on Machine Learning). Haifa, Israel.

Abstract▼ URL

We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points. The model can be used to locate change points in an online manner; and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected. We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages. We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.

#### Discovering Transcriptional Modules by Bayesian Data Integration

R. S. Savage, Z. Ghahramani, J. E. Griffin, B. de la Cruz, D. L. Wild, 2010. (Bioinformatics).

Abstract▼ URL

Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs.Availability: If interested in the code for the work presented in this article, please contact the authors.

#### Ranking Relations Using Analogies in Biological and Information Networks

R. Silva, K. A. Heller, Z. Ghahramani, E. M. Airoldi, 2010. (Annals of Applied Statistics).

Abstract▼ URL

Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = A(1):B(1), A(2):B(2), …, A(N):B(N), measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.

#### A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series

O. Stegle, K. J. Denby, E. J. Cooke, D. L. Wild, Z. Ghahramani, K. M. Borgwardt, 2010. (Journal of Computational Biology). **DOI**: 10.1089/cmb.2009.0175.

Abstract▼ URL

Understanding the regulatory mechanisms that are responsible for an organism’s response to environmental change is an important issue in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question *whether* a gene is differentially expressed, they do not explicitly address the question *when* a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a twosample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates, and is robust with respect to outliers. We apply our algorithm to study the response of *Arabidopsis thaliana* genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 observed time points. In classification experiments, our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

#### Statistical Models for Natural Sounds

Richard E. Turner, 2010. Gatsby Computational Neuroscience Unit, UCL,

Abstract▼ URL

It is important to understand the rich structure of natural sounds in order to solve important tasks, like automatic speech recognition, and to understand auditory processing in the brain. This thesis takes a step in this direction by characterising the statistics of simple natural sounds. We focus on the statistics because perception often appears to depend on them, rather than on the raw waveform. For example the perception of auditory textures, like running water, wind, fire and rain, depends on summary-statistics, like the rate of falling rain droplets, rather than on the exact details of the physical source. In order to analyse the statistics of sounds accurately it is necessary to improve a number of traditional signal processing methods, including those for amplitude demodulation, time-frequency analysis, and sub-band demodulation. These estimation tasks are ill-posed and therefore it is natural to treat them as Bayesian inference problems. The new probabilistic versions of these methods have several advantages. For example, they perform more accurately on natural signals and are more robust to noise, they can also fill-in missing sections of data, and provide error-bars. Furthermore, free-parameters can be learned from the signal. Using these new algorithms we demonstrate that the energy, sparsity, modulation depth and modulation time-scale in each sub-band of a signal are critical statistics, together with the dependencies between the sub-band modulators. In order to validate this claim, a model containing co-modulated coloured noise carriers is shown to be capable of generating a range of realistic sounding auditory textures. Finally, we explored the connection between the statistics of natural sounds and perception. We demonstrate that inference in the model for auditory textures qualitatively replicates the primitive grouping rules that listeners use to understand simple acoustic scenes. This suggests that the auditory system is optimised for the statistics of natural sounds.

#### Fast Online Anomaly Detection Using Scan Statistics

Ryan Turner, Steven Bottone, Zoubin Ghahramani, August 2010. (In Machine Learning for Signal Processing (MLSP 2010)). Edited by Samuel Kaski, David J. Miller, Erkki Oja, Antti Honkela. Kittilä, Finland. **ISBN**: 978-1-4244-7876-7.

Abstract▼ URL

We present methods to do fast online anomaly detection using scan statistics. Scan statistics have long been used to detect statistically significant bursts of events. We extend the scan statistics framework to handle many practical issues that occur in application: dealing with an unknown background rate of events, allowing for slow natural changes in background frequency, the inverse problem of finding an unusual lack of events, and setting the test parameters to maximize power. We demonstrate its use on real and synthetic data sets with comparison to other methods.

#### State-Space Inference and Learning with Gaussian Processes

Ryan Turner, Marc Peter Deisenroth, Carl Edward Rasmussen, May 13–15 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Edited by Yee Whye Teh, Mike Titterington. Chia Laguna, Sardinia, Italy. W & CP.

Abstract▼ URL

State-space inference and learning with Gaussian processes (GPs) is an unsolved problem. We propose a new, general methodology for inference and learning in nonlinear state-space models that are described probabilistically by non-parametric GP models. We apply the expectation maximization algorithm to iterate between inference in the latent state-space and learning the parameters of the underlying GP dynamics model.

**Comment:** poster.

#### Model Based Learning of Sigma Points in Unscented Kalman Filtering

Ryan Turner, Carl Edward Rasmussen, August 2010. (In Machine Learning for Signal Processing (MLSP 2010)). Edited by Samuel Kaski, David J. Miller, Erkki Oja, Antti Honkela. Kittilä, Finland. **ISBN**: 978-1-4244-7876-7.

Abstract▼ URL

The unscented Kalman filter (UKF) is a widely used method in control and time series applications. The UKF suffers from arbitrary parameters necessary for a step known as sigma point placement, causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L.

#### Statistical inference for single- and multi-band probabilistic amplitude demodulation.

Richard E. Turner, Maneesh Sahani, 2010. (In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)).

Abstract▼ URL

Amplitude demodulation is an ill-posed problem and so it is natural to treat it from a Bayesian viewpoint, inferring the most likely carrier and envelope under probabilistic constraints. One such treatment is Probabilistic Amplitude Demodulation (PAD), which, whilst computationally more intensive than traditional approaches, offers several advantages. Here we provide methods for estimating the uncertainty in the PAD-derived envelopes and carriers, and for learning free-parameters like the time-scale of the envelope. We show how the probabilistic approach can naturally handle noisy and missing data. Finally, we indicate how to extend the model to signals which contain multiple modulators and carriers.

#### Active Learning for Constrained Dirichlet Process Mixture Models

Andreas Vlachos, Zoubin Ghahramani, Ted Briscoe, 2010. (In Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics). Uppsala, Sweden.

Abstract▼ URL

Recent work applied Dirichlet Process Mixture Models to the task of verb clustering, incorporating supervision in the form of must-links and cannot-links constraints between instances. In this work, we introduce an active learning approach for constraint selection employing uncertainty-based sampling. We achieve substantial improvements over random selection on two datasets.

#### Copula Processes

Andrew Gordon Wilson, Zoubin Ghahramani, 2010. (In Advances in Neural Information Processing Systems 23). **Note**: Spotlight.

Abstract▼ URL

We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures.

**Comment:** Supplementary Material, slides.

#### Dependent Indian buffet processes

Sinead Williamson, Peter Orbanz, Zoubin Ghahramani, May 2010. (In 13th International Conference on Artificial Intelligence and Statistics). Chia Laguna, Sardinia, Italy. W & CP.

Abstract▼ URL

Latent variable models represent hidden structure in observational data. To account for the distribution of the observational data changing over time, space or some other covariate, we need generalizations of latent variable models that explicitly capture this dependency on the covariate. A variety of such generalizations has been proposed for latent variable models based on the Dirichlet process. We address dependency on covariates in binary latent feature models, by introducing a dependent Indian Buffet Process. The model generates a binary random matrix with an unbounded number of columns for each value of the covariate. Evolution of the binary matrices over the covariate set is controlled by a hierarchical Gaussian process model. The choice of covariance functions controls the dependence structure and exchangeability properties of the model. We derive a Markov Chain Monte Carlo sampling algorithm for Bayesian inference, and provide experiments on both synthetic and real-world data. The experimental results show that explicit modeling of dependencies significantly improves accuracy of predictions.

#### The IBP compound Dirichlet process and its application to focused topic modeling

Sinead Williamson, Katherine A. Heller, C. Wang, D. M. Blei, June 2010. (In 27th International Conference on Machine Learning). Haifa, Israel.

Abstract▼ URL

The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model — each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point’s “focus”, is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.

## 2009

#### Archipelago: nonparametric Bayesian semi-supervised learning

R. Adams, Zoubin Ghahramani, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

Semi-supervised learning (SSL), is classification where additional unlabeled data can be used to improve accuracy. Generative approaches are appealing in this situation, as a model of the data’s probability density can assist in identifying clusters. Nonparametric Bayesian methods, while ideal in theory due to their principled motivations, have been difficult to apply to SSL in practice. We present a nonparametric Bayesian method that uses Gaussian processes for the generative model, avoiding many of the problems associated with Dirichlet process mixture models. Our model is fully generative and we take advantage of recent advances in Markov chain Monte Carlo algorithms to provide a practical inference method. Our method compares favorably to competing approaches on synthetic and real-world multi-class data.

**Comment:** This paper was awarded Honourable Mention for Best Paper at ICML 2009.

#### Bayesian nonnegative matrix factorization with volume prior for unmixing of hyperspectral images

Morten Arngren, Mikkel N. Schmidt, Jan Larsen, September 2009. (In Machine Learning for Signal Processing, IEEE Workshop on (MLSP)). Grenoble, France. **DOI**: 10.1109/MLSP.2009.5306262. **ISBN**: 978-1-4244-4947-7.

Abstract▼ URL

In hyperspectral image analysis the objective is to unmix a set of acquired pixels into pure spectral signatures (endmembers) and corresponding fractional abundances. The Non-negative Matrix Factorization (NMF) methods have received a lot of attention for this unmixing process. Many of these NMF based unmixing algorithms are based on sparsity regularization encouraging pure spectral endmembers, but this is not optimal for certain applications, such as foods, where abundances are not sparse. The pixels will theoretically lie on a simplex and hence the endmembers can be estimated as the vertices of the smallest enclosing simplex. In this context we present a Bayesian framework employing a volume constraint for the NMF algorithm, where the posterior distribution is numerically sampled from using a Gibbs sampling procedure. We evaluate the method on synthetical and real hyperspectral data of wheat kernels.

**Comment:** This paper was “rated among the best papers submitted” to the 2009 Machine Learning for Signal Processing conference.

#### A Structured Model of Video Reproduces Primary Visual Cortical Organisation

Pietro Berkes, Richard E. Turner, Maneesh Sahani, 09 2009. (PLoS Computational Biology). Public Library of Science. **DOI**: 10.1371/journal.pcbi.1000495.

Abstract▼ URL

The visual system must learn to infer the presence of objects and features in the world from the images it encounters, and as such it must, either implicitly or explicitly, model the way these elements interact to create the image. Do the response properties of cells in the mammalian visual system reflect this constraint? To address this question, we constructed a probabilistic model in which the identity and attributes of simple visual elements were represented explicitly and learnt the parameters of this model from unparsed, natural video sequences. After learning, the behaviour and grouping of variables in the probabilistic model corresponded closely to functional and anatomical properties of simple and complex cells in the primary visual cortex (V1). In particular, feature identity variables were activated in a way that resembled the activity of complex cells, while feature attribute variables responded much like simple cells. Furthermore, the grouping of the attributes within the model closely parallelled the reported anatomical grouping of simple cells in cat V1. Thus, this generative model makes explicit an interpretation of complex and simple cells as elements in the segmentation of a visual scene into basic independent features, along with a parametrisation of their moment-by-moment appearances. We speculate that such a segmentation may form the initial stage of a hierarchical system that progressively separates the identity and appearance of more articulated visual elements, culminating in view-invariant object recognition.

#### Bayesian two-sample tests

Karsten M. Borgwardt, Zoubin Ghahramani, 2009. (arXiv).

Abstract▼ URL

In this paper, we present two classes of Bayesian approaches to the two-sample problem. Our first class of methods extends the Bayesian t-test to include all parametric models in the exponential family and their conjugate priors. Our second class of methods uses Dirichlet process mixtures (DPM) of such conjugate-exponential distributions as flexible nonparametric priors over the unknown distributions.

#### Nash equilibria in multi-agent motor interactions

Daniel A. Braun, Pedro A. Ortega, Daniel M. Wolpert, 2009. (PLoS Computational Biology).

Abstract▼ URL

Social interactions in classic cognitive games likeBlack-box alpha (BB-α) is a new approximate inference method based on the minimization of α-divergences. BB-αscales to large datasets because it can be implemented using stochastic gradient descent. BB-αcan be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter α, the method is able to interpolate between variational Bayes (VB) (α→0) and an algorithm similar to expectation propagation (EP) (α= 1). Experiments on probit regression and neural network regression and classification problems show that BB-αwith non-standard settings of α, such as α= 0.5, usually produces better predictions than with α→0 (VB) or α= 1 (EP). the ultimatum game or the prisoner’s dilemma typically lead to Nash equilibria when multiple competitive decision makers with perfect knowledge select optimal strategies. However, in evolutionary game theory it has been shown that Nash equilibria can also arise as attractors in dynamical systems that can describe, for example, the population dynamics of microorganisms. Similar to such evolutionary dynamics, we find that Nash equilibria arise naturally in motor interactions in which players vie for control and try to minimize effort. When confronted with sensorimotor interaction tasks that correspond to the classical prisoner’s dilemma and the rope-pulling game, two-player motor interactions led predominantly to Nash solutions. In contrast, when a single player took both roles, playing the sensorimotor game bimanually, cooperative solutions were found. Our methodology opens up a new avenue for the study of human motor interactions within a game theoretic framework, suggesting that the coupling of motor systems can lead to game theoretic solutions.

#### Influence of heart rate on the BOLD signal: the cardiac response function

C. Chang, J. P. Cunningham, G. Glover, 2009. (NeuroImage).

Abstract▼ URL

It has previously been shown that low-frequency fluctuations in both respiratory volume and cardiac rate can induce changes in the blood-oxygen level dependent (BOLD) signal. Such physiological noise can obscure the detection of neural activation using fMRI, and it is therefore important to model and remove the effects of this noise. While a hemodynamic response function relating respiratory variation (RV) and the BOLD signal has been described, no such mapping for heart rate (HR) has been proposed. In the current study, the effects of RV and HR are simultaneously deconvolved from resting state fMRI. It is demonstrated that a convolution model including RV and HR can explain significantly more variance in gray matter BOLD signal than a model that includes RV alone, and an average HR response function is proposed that well characterizes our subject population. It is observed that the voxel-wise morphology of the deconvolved RV responses is preserved when HR is included in the model, and that its form is adequately modeled by Birn et al.’s previously described respiration response function. Furthermore, it is shown that modeling out RV and HR can significantly alter functional connectivity maps of the default-mode network.

#### Probabilistic models for incomplete multi-dimensional arrays

W. Chu, Z. Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Microtome Publishing (paper) Journal of Machine Learning Research. **Note**: ISSN 1938-7228.

Abstract▼ URL

In multiway data, each sample is measured by multiple sets of correlated attributes. We develop a probabilistic framework for modeling structural dependency from partially observed multi-dimensional array data, known as pTucker. Latent components associated with individual array dimensions are jointly retrieved while the core tensor is integrated out. The resulting algorithm is capable of handling large-scale data sets. We verify the usefulness of this approach by comparing against classical models on applications to modeling amino acid fluorescence, collaborative filtering and a number of benchmark multiway array data.

#### Analytic Moment-based Gaussian Process Filtering

Marc Peter Deisenroth, Marco F. Huber, Uwe D. Hanebeck, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

We propose an analytic moment-based filter for nonlinear stochastic dynamic systems modeled by Gaussian processes. Exact expressions for the expected value and the covariance matrix are provided for both the prediction step and the filter step, where an additional Gaussian assumption is exploited in the latter case. Our filter does not require further approximations. In particular, it avoids finite-sample approximations. We compare the filter to a variety of Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed by Ko et al. (2007).

**Comment:** With corrections. code.

#### Bayesian Inference for Efficient Learning in Control

Marc Peter Deisenroth, Carl Edward Rasmussen, June 2009. (In Multidisciplinary Symposium on Reinforcement Learning). Montréal, QC, Canada.

Abstract▼ URL

In contrast to humans or animals, artificial learners often require more trials when learning motor control tasks solely based on experience. Efficient autonomous learners will reduce the amount of engineering required to solve control problems. By using probabilistic forward models, we can employ two key ingredients of biological learning systems to speed up artificial learning. We present a consistent and coherent Bayesian framework that allows for efficient autonomous experience-based learning. We demonstrate the success of our learning algorithm by applying it to challenging nonlinear control problems in simulation and in hardware.

#### Efficient Reinforcement Learning for Motor Control

Marc Peter Deisenroth, Carl Edward Rasmussen, September 2009. (In 10th International PhD Workshop on Systems and Control). Hluboká nad Vltavou, Czech Republic.

Abstract▼ URL

Artificial learners often require many more trials than humans or animals when learning motor control tasks in the absence of expert knowledge. We implement two key ingredients of biological learning systems, generalization and incorporation of uncertainty into the decision-making process, to speed up artificial learning. We present a coherent and fully Bayesian framework that allows for efficient artificial learning in the absence of expert knowledge. The success of our learning framework is demonstrated on challenging nonlinear control problems in simulation and in hardware.

#### Gaussian process dynamic programming

Marc Peter Deisenroth, Carl Edward Rasmussen, Jan Peters, March 2009. (Neurocomputing). Elsevier B. V.. **DOI**: 10.1016/j.neucom.2008.12.019.

Abstract▼ URL

Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm. We consider both a classic optimal control problem, where problem-specific prior knowledge is available, and a classic RL problem, where only very general priors can be used. For the classic optimal control problem, GPDP models the unknown value functions with Gaussian processes and generalizes dynamic programming to continuous-valued states and actions. For the RL problem, GPDP starts from a given initial state and explores the state space using Bayesian active learning. To design a fast learner, available data have to be used efficiently. Hence, we propose to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly. In both cases, we successfully apply the resulting continuous-valued controllers to the under-actuated pendulum swing up and analyze the performances of the suggested algorithms. It turns out that GPDP uses data very efficiently and can be applied to problems, where classic dynamic programming would be cumbersome.

**Comment:** code.

#### The Indian Buffet Process: Scalable Inference and Extensions

Finale Doshi-Velez, August 2009. University of Cambridge, Cambridge, UK.

Abstract▼ URL

Many unsupervised learning problems seek to identify hidden features from observations. In many real-world situations, the number of hidden features is unknown. To avoid specifying the number of hidden features a priori, one can use the Indian Buffet Process (IBP): a nonparametric latent feature model that does not bound the number of active features in a dataset. While elegant, the lack of efficient inference procedures for the IBP has prevented its application in large-scale problems. The core contribution of this thesis are three new inference procedures that allow inference in the IBP to be scaled from a few hundred to 100,000 observations. This thesis contains three parts: (1) An introduction to the IBP and a review of inference techniques and extensions. The first chapters summarise three constructions for the IBP and review all currently published inference techniques. Appendix C reviews extensions of the IBP to date. (2) Novel techniques for scalable Bayesian inference. This thesis presents three new inference procedures: (a) an accelerated Gibbs sampler for efficient Bayesian inference in a broad class of conjugate models, (b) a parallel, asynchronous Gibbs sampler that allows the accelerated Gibbs sampler to be distributed across multiple processors, and (c) a variational inference procedure for the IBP. (3) A framework for structured nonparametric latent feature models. We also present extensions to the IBP to model more sophisticated relationships between the co-occurring hidden features, providing a general framework for correlated non-parametric feature models.

#### The Infinite Partially Observable Markov Decision Process

Finale Doshi-Velez, December 2009. (In Advances in Neural Information Processing Systems 23). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems.

#### Accelerated Gibbs sampling for the Indian buffet process

Finale Doshi-Velez, Zoubin Ghahramani, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a non-parametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

#### Accelerated sampling for the Indian Buffet Process

Finale Doshi-Velez, Zoubin Ghahramani, 2009. (In ICML). Edited by Andrea Pohoreckyj Danyluk, Léon Bottou, Michael L. Littman. acm. ACM International Conference Proceeding Series. **ISBN**: 978-1-60558-516-1.

Abstract▼ URL

We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a nonparametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

#### Large Scale Non-parametric Inference: Data Parallelisation in the Indian Buffet Process

Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani, December 2009. (In Advances in Neural Information Processing Systems 23). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.

#### Variational inference for the Indian buffet process

F. Doshi-Velez, K.T. Miller, J. Van Gael, Y.W. Teh, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Clearwater Beach, FL, USA. Journal of Machine Learning Research.

Abstract▼ URL

The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on a truncated stick-breaking approximation, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes.

#### Variational Inference for the Indian Buffet Process

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh, April 2009. University of Cambridge, Computational and Biological Learning Laboratory, Department of Engineering.

Abstract▼ URL

The Indian Buffet Process (IBP) is a nonparametric prior for latent feature models in which observations are influenced by a combination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. Latent feature models seek to infer these unobserved features from a set of observations; the IBP provides a principled prior in situations where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic variational method for inference in the IBP based on truncating to infinite models, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes. This technical report is a longer version of Doshi-Velez et al. (2009).

#### Choosing a Variable to Clamp: Approximate Inference Using Conditioned Belief Propagation

Frederik Eaton, Zoubin Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Journal of Machine Learning Research.

Abstract▼ URL

In this paper we propose an algorithm for approximate inference on graphical models based on belief propagation (BP). Our algorithm is an approximate version of Cutset Conditioning, in which a subset of variables is instantiated to make the rest of the graph singly connected. We relax the constraint of single-connectedness, and select variables one at a time for conditioning, running belief propagation after each selection. We consider the problem of determining the best variable to clamp at each level of recursion, and propose a fast heuristic which applies back-propagation to the BP updates. We demonstrate that the heuristic performs better than selecting variables at random, and give experimental results which show that it performs competitively with existing approximate inference algorithms.

#### Statistical tools for ultra-deep pyrosequencing of fast evolving viruses

David A. Knowles, Susan Holmes, 2009. (In NIPS Workshop: Computational Biology).

Abstract▼ URL

We aim to detect minor variant Hepatitis B viruses (HBV) in 38 pyrosequencing samples from infected individuals. Errors involved in the amplification and ultra deep pyrosequencing (UDPS) of these samples are characterised using HBV plasmid controls. Homopolymeric regions and quality scores are found to be significant covariates in determining insertion and deletion (indel) error rates, but not mismatch rates which depend on the nucleotide transition matrix. This knowledge is used to derive two methods for classifying genuine mutations: a hypothesis testing framework and a mixture model. Using an approximate “ground truth” from a limiting dilution Sanger sequencing run, these methods are shown to outperform the naive percentage threshold approach. The possibility of early stage PCR errors becoming significant is investigated by simulation, which underlines the importance of the initial copy number.

**Comment:** web site

#### A kernel method for unsupervised structured network inference

C. Lippert, O. Stegle, Z. Ghahramani, K. Borgwardt, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Journal of Machine Learning Research. **Note**: ISSN: 1938-7228.

Abstract▼ URL

Network inference is the problem of inferring edges between a set of real-world objects, for instance, interactions between pairs of proteins in bioinformatics. Current kernel-based approaches to this problem share a set of common features: (i) they are supervised and hence require labeled training data; (ii) edges in the network are treated as mutually independent and hence topological properties are largely ignored; (iii) they lack a statistical interpretation. We argue that these common assumptions are often undesirable for network inference, and propose (i) an unsupervised kernel method (ii) that takes the global structure of the network into account and (iii) is statistically motivated. We show that our approach can explain commonly used heuristics in statistical terms. In experiments on social networks, dfferent variants of our method demonstrate appealing predictive performance.

#### Occlusive Components Analysis

Jörg Lücke, Richard E. Turner, Maneesh Sahani, Marc Henniges, 2009. (In Advances in Neural Information Processing Systems 22). Edited by Y Bengio, D Schuurmans, J Lafferty, C K I Williams, A Culotta. mit.

Abstract▼ URL

We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learnt from an unlabelled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. However, we show that tractable approximations to Expectation Maximization (EM) can be found if the training images each contain only a small number of objects on average. In numerical experiments it is shown that these approximations recover the correct set of object parameters. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating causes. Experiments based on the standard bars benchmark test for object learning show that the algorithm performs well in comparison to other recent component extraction approaches. The model and the learning algorithm thus connect research on occlusion with the research field of multiple-causes component extraction methods.

#### Bayesian Exponential Family PCA

Shakir Mohamed, Katherine A. Heller, Zoubin Ghahramani, December 2009. (In Advances in Neural Information Processing Systems 21). Edited by D. Koller, D. Schuurmans, Y. Bengio, L. Bottou. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.

**Comment:** spotlight.

#### Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Peter Orbanz, 2009. (In Advances in Neural Information Processing Systems 22). Edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta. The MIT Press.

Abstract▼ URL

We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data.

**Comment:** Supplements (proofs) and techreport version

#### Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures

Carl Edward Rasmussen, Bernhard J. de la Cruz, Zoubin Ghahramani, David L. Wild, 2009. (IEEE/ACM Transactions on Computational Biology and Bioinformatics). **DOI**: 10.1109/TCBB.2007.70269. **ISSN**: 1545-5963.

Abstract▼ URL

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

#### R/BHC: fast Bayesian hierarchical clustering for microarray data

R. Savage, K. A. Heller, Y. Xu, Zoubin Ghahramani, W. Truman, M. Grant, K. Denby, D. L. Wild, August 2009. (BMC Bioinformatics 2009). BioMed Central. **DOI**: 10.1186/1471-2105-10-242. **ISSN**: 1471-2105. **PubMed ID**: 19660130.

Abstract▼ URL

Background: Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained. Results: We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge. Conclusion: Biologically plausible results are presented from a well studied data set: expression profiles of *A. thaliana* subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.

#### Function factorization using warped Gaussian processes

Mikkel N. Schmidt, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

We introduce a new approach to non-linear regression called function factorization, that is suitable for problems where an output variable can reasonably be modeled by a number of multiplicative interaction terms between non-linear functions of the inputs. The idea is to approximate a complicated function on a high-dimensional space by the sum of products of simpler functions on lower-dimensional subspaces. Function factorization can be seen as a generalization of matrix and tensor factorization methods, in which the data are approximated by the sum of outer products of vectors. We present a non-parametric Bayesian approach to function factorization where the priors over the factorizing functions are warped Gaussian processes, and we do inference using Hamiltonian Markov chain Monte Carlo. We demonstrate the superior predictive performance of the method on a food science data set compared to Gaussian process regression and tensor factorization using PARAFAC and GEMANOVA models.

#### Linearly constrained Bayesian matrix factorization for blind source separation

Mikkel N. Schmidt, December 2009. (In Advances in Neural Information Processing Systems 22). Edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We present a general Bayesian approach to probabilistic matrix factorization subject to linear constraints. The approach is based on a Gaussian observation model and Gaussian priors with bilinear equality and inequality constraints. We present an efficient Markov chain Monte Carlo inference procedure based on Gibbs sampling. Special cases of the proposed model are Bayesian formulations of non-negative matrix factorization and factor analysis. The method is evaluated on a blind source separation problem. We demonstrate that our algorithm can be used to extract meaningful and interpretable features that are remarkably different from features extracted using existing related matrix factorization techniques.

**Comment:** code.

#### Probabilistic non-negative tensor factorization using Markov chain Monte Carlo

Mikkel N. Schmidt, Shakir Mohamed, August 2009. (In European Signal Processing Conference (EUSIPCO)). Glasgow, Scotland.

Abstract▼ URL

We present a probabilistic model for learning non-negative tensor factorizations (NTF), in which the tensor factors are latent variables associated with each data dimension. The non-negativity constraint for the latent factors is handled by choosing priors with support on the non-negative numbers. Two Bayesian inference procedures based on Markov chain Monte Carlo sampling are described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We evaluate the model on two food science data sets, and show that the probabilistic NTF model leads to better predictions and avoids overfitting compared to existing NTF approaches.

**Comment:** Rated by reviewers amongst the top 5% of the presented papers.

#### Bayesian non-negative matrix factorization

Mikkel N. Schmidt, Ole Winther, Lars Kai Hansen, March 2009. (In 8th International Conference on Independent Component Analysis and Signal Separation). Paraty, Brazil. Springer. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We present a Bayesian treatment of non-negative matrix factorization (NMF), based on a normal likelihood and exponential priors, and derive an efficient Gibbs sampler to approximate the posterior density of the NMF factors. On a chemical brain imaging data set, we show that this improves interpretability by providing uncertainty estimates. We discuss how the Gibbs sampler can be used for model order selection by estimating the marginal likelihood, and compare with the Bayesian information criterion. For computing the maximum a posteriori estimate we present an iterated conditional modes algorithm that rivals existing state-of-the-art NMF algorithms on an image feature extraction problem.

#### Factorial mixture of Gaussians and the marginal independence model

R. Silva, Z. Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Clearwater Beach, FL, USA. Journal of Machine Learning Research. **Note**: ISSN: 1938-7228.

Abstract▼ URL

Marginal independence constraints play an important role in learning with graphical models. One way of parameterizing a model of marginal independencies is by building a latent variable model where two independent observed variables have no common latent source. In sparse domains, however, it might be advantageous to model the marginal observed distribution directly, without explicitly including latent variables in the model. There have been recent advances in Gaussian and binary models of marginal independence, but no models with non-linear dependencies between continuous variables has been proposed so far. In this paper, we describe how to generalize the Gaussian model of marginal independencies based on mixtures, and how to learn parameters. This requires a non-standard parameterization and raises difficult non-linear optimization issues.

**Comment:** Code at http://www.homepages.ucl.ac.uk/~ucgtrbd/code/fmog-version0.zip

#### Discovering temporal patterns of differential gene expression in microarray time series

O. Stegle, K. Denby, S. McHattie, A. Meade, D. Wild, Z. Ghahramani, K Borgwardt, September 2009. (In German Conference on Bioinformatics). Halle, Germany.

Abstract▼ URL

A wealth of time series of microarray measurements have become available over recent years. Several two-sample tests for detecting differential gene expression in these time series have been defined, but they can only answer the question *whether* a gene is differentially expressed across the whole time series, not *in which intervals* it is differentially expressed. In this article, we propose a Gaussian process based approach for studying these dynamics of differential gene expression. In experiments on *Arabidopsis thaliana* gene expression levels, our novel technique helps us to uncover that the family of WRKY transcription factors appears to be involved in the early response to infection by a fungal pathogen.

#### A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series

O. Stegle, K. Denby, David L. Wild, Zoubin Ghahramani, Karsten Borgwardt, 2009. (In 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2009)). Tucson, AZ, USA. Springer-Verlag. Lecture Notes in Bioinformatics. **DOI**: 10.1007/978-3-642-02008-7_14. **ISBN**: 978-3-642-02007-0.

Abstract▼ URL

Understanding the regulatory mechanisms that are responsible for an organism’s response to environmental changes is an important question in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question *whether* a gene is differentially expressed, they do not explicitly address the question *when* a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying *intervals* of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates and is robust with respect to outliers. We apply our algorithm to study the response of *Arabidopsis thaliana* genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 time points. In classification experiments our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

#### System Identification in Gaussian Process Dynamical Systems

Ryan Turner, Marc Peter Deisenroth, Carl Edward Rasmussen, December 2009. (In NIPS Workshop on Nonparametric Bayes). Edited by Dilan Görür. Whistler, BC, Canada.

**Comment:** poster.

#### Adaptive Sequential Bayesian Change Point Detection

Ryan Turner, Yunus Saatçi, Carl Edward Rasmussen, December 2009. (In NIPS Workshop on Temporal Segmentation). Edited by Zaïd Harchaoui. Whistler, BC, Canada.

Abstract▼ URL

Real-world time series are often nonstationary with respect to the parameters of some underlying prediction model (UPM). Furthermore, it is often desirable to adapt the UPM to incoming regime changes as soon as possible, necessitating sequential inference about change point locations. A Bayesian algorithm for online change point detection (BOCPD) has been introduced recently by Adams and MacKay (2007). In this algorithm, uncertainty about the last change point location is updated sequentially, and is integrated out to make online predictions robust to parameter changes. BOCPD requires a set of fixed hyper-parameters which allow the user to fully specify the hazard function for change points and the prior distribution over the parameters of the UPM. In practice, finding the “right” hyper-parameters can be quite difficult. We therefore extend BOCPD by introducing hyper-parameter learning, without sacrificing the online nature of the algorithm. Hyper-parameter learning is performed by optimizing the marginal likelihood of the BOCPD model, a closed-form quantity which can be computed sequentially. We illustrate performance on three real-world datasets.

#### The infinite HMM for unsupervised PoS tagging

J. Van Gael, A. Vlachos, Z. Ghahramani, August 2009. (In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP)). Singapore. Association for Computational Linguistics. **ISBN**: 978-1-932432-62-6.

Abstract▼ URL

We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations.

#### Unsupervised and constrained Dirichlet process mixture models for verb clustering

Andreas Vlachos, Anna Korhonen, Zoubin Ghahramani, 2009. (In Proceedings of the workshop on geometrical models of natural language semantics).

Abstract▼

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DP- MMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation per- formed highlights the benefits of both standard and constrained DPMMs com- pared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

#### Unsupervised and constrained Dirichlet process mixture models for verb clustering

A. Vlachos, A Korhonen, Z. Ghahramani, March 2009. (In 4th Workshop on Statistical Machine Translation, EACL '09). Athens, Greece.

Abstract▼ URL

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

#### Tree-based inference for Dirichlet process mixtures

Yang Xu, Katherine A. Heller, Zoubin Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Microtome Publishing (paper), Journal of Machine Learning Research (online). **Note**: ISSN 1938-7228.

Abstract▼ URL

The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density estimation. Unfortunately, like in many statistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on Markov chain Monte Carlo (MCMC) [1, 2, 3], variational Bayesian (VB) [4] and collapsed variational methods [5], [6] recently introduced a novel class of approximation for DPMs based on Bayesian hierarchical clustering (BHC). These tree-based combinatorial approximations efficiently sum over exponentially many ways of partitioning the data and offer a novel lower bound on the marginal likelihood of the DPM [6]. In this paper we make the following contributions: (1) We show empirically that the BHC lower bounds are substantially tighter than the bounds given by VB [4] and by collapsed variational methods [5] on synthetic and real datasets. (2) We also show that BHC offers a more accurate predictive performance on these datasets. (3) We further improve the tree-based lower bounds with an algorithm that efficiently sums contributions from alternative trees. (4) We present a fast approximate method for BHC. Our results suggest that our combinatorial approximate inference methods and lower bounds may be useful not only in DPMs but in other models as well.

#### Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, M. Sahani, 2009. (Journal of Neurophysiology).

Abstract▼ URL

We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from many neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional, noisy spiking activity in a compact form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the spike trains are first smoothed over time, then a static dimensionality- reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way and that account for spiking variability, which may vary both across neurons and across time. We then present a novel method for extracting neural trajectories – Gaussian-process factor analysis (GPFA) – which unifies the smoothing and dimensionality- reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that the proposed extensions improved the predictive ability of the two-stage methods. The predictive ability was further improved by going to GPFA. From the extracted trajectories, we directly observed a convergence in neural state during motor planning, an effect that was shown indirectly by previous studies. We then show how such methods can be a powerful tool for relating the spiking activity across a neural population to the subject’s behavior on a single-trial basis. Finally, to assess how well the proposed methods characterize neural population activity when the underlying time course is known, we performed simulations that revealed that GPFA performed tens of percent better than the best two-stage method.

#### Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity

B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, M. Sahani, December 2009. (In Advances in Neural Information Processing Systems 21). Vancouver, BC.

Abstract▼ URL

We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from many neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional, noisy spiking activity in a compact form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the spike trains are first smoothed over time, then a static dimensionality- reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way and that account for spiking variability, which may vary both across neurons and across time. We then present a novel method for extracting neural trajectories – Gaussian-process factor analysis (GPFA) – which unifies the smoothing and dimensionality- reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that the proposed extensions improved the predictive ability of the two-stage methods. The predictive ability was further improved by going to GPFA. From the extracted trajectories, we directly observed a convergence in neural state during motor planning, an effect that was shown indirectly by previous studies. We then show how such methods can be a powerful tool for relating the spiking activity across a neural population to the subject’s behavior on a single-trial basis. Finally, to assess how well the proposed methods characterize neural population activity when the underlying time course is known, we performed simulations that revealed that GPFA performed tens of percent better than the best two-stage method.

## 2008

#### On sparsity and overcompleteness in image models

Pietro Berkes, Richard E. Turner, Maneesh Sahani, 2008. (In nips20). Edited by J. C. Platt, D. Koller, Y. Singer, S. Roweis. mit.

Abstract▼ URL

Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete.

#### Derivation of Expectation Propagation for "Fast Gaussian process methods for point process intensity estimation"

J. P. Cunningham, 2008. Stanford University,

Abstract▼ URL

We derive the Expectation Propagation algorithm updates for approximating the posterior distribution on intensity in a conditionally inhomogeneous gamma interval process with a Gaussian Process prior (GP IGIP), a model which appeared in Cunningham, Shenoy, Sahani (2008) ICML.

#### Fast Gaussian process methods for point process intensity estimation

J. P. Cunningham, K. V. Shenoy, M. Sahani, June 2008. (In 25th International Conference on Machine Learning). Helsinki, Finland.

Abstract▼ URL

Point processes are difficult to analyze because they provide only a sparse and noisy observation of the intensity function driving the process. Gaussian Processes offer an attractive framework within which to infer underlying intensity functions. The result of this inference is a continuous function defined across time that is typically more amenable to analytical efforts. However, a naive implementation will become computationally infeasible in any problem of reasonable size, both in memory and run time requirements. We demonstrate problem specific methods for a class of renewal processes that eliminate the memory burden and reduce the solve time by orders of magnitude.

#### Inferring neural firing rates from spike trains using Gaussian processes

J. P. Cunningham, B. M. Yu, K. V. Shenoy, M. Sahani, December 2008. (In Advances in Neural Information Processing Systems 20). Vancouver, BC.

Abstract▼ URL

Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators.

**Comment:** Spotlight Presentation

#### Approximate Dynamic Programming with Gaussian Processes

Marc Peter Deisenroth, Jan Peters, Carl Edward Rasmussen, June 2008. (In 2008 American Control Conference (ACC 2008)). Seattle, WA, USA.

Abstract▼ URL

In general, it is difficult to determine an optimal closed-loop policy in nonlinear control problems with continuous-valued state and control domains. Hence, approximations are often inevitable. The standard method of discretizing states and controls suffers from the curse of dimensionality and strongly depends on the chosen temporal sampling rate. The paper introduces Gaussian Process Dynamic Programming (GPDP). In GPDP, value functions in the Bellman recursion of the dynamic programming algorithm are modeled using Gaussian processes. GPDP returns an optimal state-feedback for a finite set of states. Based on these outcomes, we learn a possibly discontinuous closed-loop policy on the entire state space by switching between two independently trained Gaussian processes.

**Comment:** code.

#### Model-Based Reinforcement Learning with Continuous States and Actions

Marc Peter Deisenroth, Carl Edward Rasmussen, Jan Peters, April 2008. (In Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN 2008)). Bruges, Belgium.

Abstract▼ URL

Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment.

#### Spoken Language Interaction with Model Uncertainty: An Adaptive Human-Robot Interaction System

Finale Doshi, Nicholas Roy, December 2008. (Connection Science).

Abstract▼ URL

Spoken language is one of the most intuitive forms of interaction between humans and agents. Unfortunately, agents that interact with people using natural language often experience communication errors and do not correctly understand the user’s intentions. Recent systems have successfully used probabilistic models of speech, language, and user behavior to generate robust dialog performance in the presence of noisy speech recognition and ambiguous language choices, but decisions made using these probabilistic models are still prone to errors due to the complexity of acquiring and maintaining a complete model of human language and behavior. In this paper, we describe a decision-theoretic model for human-robot interaction using natural language. Our algorithm is based on the Partially Observable Markov Decision Process (POMDP), which allows agents to choose actions that are robust not only to uncertainty from noisy or ambiguous speech recognition but also unknown user models. Like most dialog systems, a POMDP is defined by a large number of parameters that may be difficult to specify a priori from domain knowledge, and learning these parameters from the user may require an unacceptably long training period. We describe an extension to the POMDP model that allows the agent to acquire a linguistic model of the user online, including new vocabulary and word choice preferences. Our approach not only avoids a training period of constant questioning as the agent learns, but also allows the agent to actively query for additional information when its uncertainty suggests a high risk of mistakes. We demonstrate our approach both in simulation and on a natural language interaction system for a robotic wheelchair application.

#### Data, modelling and inference in road traffic networks

Richard J. Gibbens, Yunus Saatçi, June 2008. (Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences). **DOI**: 10.1098/rsta.2008.0020.

Abstract▼ URL

In this paper, we study UK road traffic data and explore a range of modelling and inference questions that arise from them. For example, loop detectors on the M25 motorway record speed and flow measurements at regularly spaced locations as well as the entry and exit lanes of junctions. An exploratory study of these data helps us to better understand and quantify the nature of congestion on the road network. From a traveller’s perspective it is crucially important to understand the overall journey times and we look at methods to improve our ability to predict journey times given access jointly to both real-time and historical loop detector data. Throughout this paper we will comment on related work derived from US freeway data.

#### Statistical models for partial membership

Katherine A. Heller, Sinead Williamson, Zoubin Ghahramani, July 2008. (In 25th International Conference on Machine Learning). Edited by Andrew McCallum, Sam Roweis. Helsinki, Finland. Omnipress.

Abstract▼ URL

We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data (Gasch & Eisen, 2002). Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

#### Metropolis algorithms for representative subgraph sampling

C. Hübler, K. Borgwardt, H.-P. Kriegel, Z. Ghahramani, December 2008. (In Proceedings of 8th IEEE International Conference on Data Mining (ICDM 2008)). Pisa, Italy. IEEE. **Note**: ISSN: 1550-4786.

Abstract▼ URL

While data mining in chemoinformatics studied graph data with dozens of nodes, systems biology and the Internet are now generating graph data with thousands and millions of nodes. Hence data mining faces the algorithmic challenge of coping with this significant increase in graph size: Classic algorithms for data analysis are often too expensive and too slow on large graphs. While one strategy to overcome this problem is to design novel efficient algorithms, the other is to ‘reduce’ the size of the large graph by sampling. This is the scope of this paper: We will present novel Metropolis algorithms for sampling a ‘representative’ small subgraph from the original large graph, with ‘representative’ describing the requirement that the sample shall preserve crucial graph properties of the original graph. In our experiments, we improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by producing representative subgraph samples that are both smaller and of higher quality than those produced by other methods from the literature.

#### Outlier robust Gaussian process classification

H. Kim, Zoubin Ghahramani, December 2008. (In Structural, Syntactic and Statistical Pattern Recognition). Edited by L. Niels da Vitoria. (Lecture Notes in Computer Science (LNCS)). Berlin, Germany. Springer Berlin / Heidelberg. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

Gaussian process classifiers (GPCs) are a fully statistical model for kernel classification. We present a form of GPC which is robust to labeling errors in the data set. This model allows label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We derive an outlier robust algorithm for training this model which alternates iterations based on the EP approximation and hyperparameter updates until convergence. We show the usefulness of the proposed algorithm with model selection method through simulation results.

#### Outlier Robust Gaussian Process Classification

Hyun-Chul Kim, Zoubin Ghahramani, 2008. (In SSPR/SPR). Edited by Niels da Vitoria Lobo, Takis Kasparis, Fabio Roli, James Tin-Yau Kwok, Michael Georgiopoulos, Georgios C. Anagnostopoulos, Marco Loog. Springer. Lecture Notes in Computer Science. **ISBN**: 978-3-540-89688-3.

Abstract▼ URL

Gaussian process classifiers (GPCs) are a fully statistical model for kernel classification. We present a form of GPC which is robust to labeling errors in the data set. This model allows label noise not only near the class boundaries, but also far from the class boundaries which can result from mistakes in labelling or gross errors in measuring the input features. We derive an outlier robust algorithm for training this model which alternates iterations based on the EP approximation and hyperparameter updates until convergence. We show the usefulness of the proposed algorithm with model selection method through simulation results.

#### Approximations for Binary Gaussian Process Classification

Hannes Nickisch, Carl Edward Rasmussen, October 2008. (Journal of Machine Learning Research).

Abstract▼ URL

We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

#### Probabilistic Inference for Fast Learning in Control

Carl Edward Rasmussen, Marc Peter Deisenroth, November 2008. (In Recent Advances in Reinforcement Learning). Edited by S. Girgin, M. Loth, R. Munos, P. Preux, D. Ryabko. Villeneuve d'Ascq, France. Springer-Verlag. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations.

**Comment:** videos and more. slides.

#### Latent space variational Bayes

J.M. Sung, Z. Ghahramani, S.Y. Bang, November 2008. (IEEE Transactions on Pattern Analysis and Machine Intelligence). IEEE.

Abstract▼ URL

Variational Bayesian Expectation-Maximization (VBEM), an approximate inference method for probabilistic models based on factorizing over latent variables and model parameters, has been a standard technique for practical Bayesian inference. In this paper, we introduce a more general approximate inference framework for conjugate-exponential family models, which we call Latent-Space Variational Bayes (LSVB). In this approach, we integrate out the model parameters in an exact way, leaving only the latent variables. It can be shown that the LSVB approach gives better estimates of the model evidence as well as the distribution over the latent variables than the VBEM approach, but, in practice, the distribution over the latent variables has to be approximated. As a practical implementation, we present a First-order LSVB (FoLSVB) algorithm to approximate the distribution over the latent variables. From this approximate distribution, one can also estimate the model evidence and the posterior over the model parameters. The FoLSVB algorithm is directly comparable to the VBEM algorithm and has the same computational complexity. We discuss how LSVB generalizes the recently proposed collapsed variational methods to general conjugate-exponential families. Examples based on mixtures of Gaussians and mixtures of Bernoullis with synthetic and real-world data sets are used to illustrate some advantages of our method over VBEM.

#### Second-order latent space variational Bayes for approximate Bayesian inference

J.M. Sung, Z. Ghahramani, S.Y. Bang, December 2008. (IEEE Signal Processing Letters). IEEE.

Abstract▼ URL

In this letter, we consider a variational approximate Bayesian inference framework, latent-space variational Bayes (LSVB), in the general context of conjugate-exponential family models with latent variables. In the LSVB approach, we integrate out model parameters in an exact way and then perform the variational inference over only the latent variables. It can be shown that LSVB can achieve better estimates of the model evidence as well as the distribution over the latent variables than the popular variational Bayesian expectation-maximization (VBEM). However, the distribution over the latent variables in LSVB has to be approximated in practice. As an approximate implementation of LSVB, we propose a second-order LSVB (SoLSVB) method. In particular, VBEM can be derived as a special case of a first-order approximation in LSVB. SoLSVB can capture higher order statistics neglected in VBEM and can therefore achieve a better approximation. Examples of Gaussian mixture models are used to illustrate the comparison between our method and VBEM, demonstrating the improvement.

#### Modeling natural sounds with modulation cascade processes

Richard E. Turner, Maneesh Sahani, 2008. (In nips20). Edited by J. C. Platt, D. Koller, Y. Singer, S. Roweis. mit.

Abstract▼ URL

Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼1s); phonemes (∼10−1 s); glottal pulses (∼ 10−2s); and formants (∼ 10−3s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscience-inspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task.

#### Dirichlet process mixture models for verb clustering

Andreas Vlachos, Zoubin Ghahramani, Anna Korhonen, 2008. (In Proceedings of the ICML workshop on Prior Knowledge for Text and Language).

Abstract▼

In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin’s (1993) verb classes using the recently introduced V- measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.

#### Dirichlet process mixture models for verb clustering

A. Vlachos, Z. Ghahramani, A Korhonen, July 2008. (In ICML Workshop on Prior Knowledge for Text and Language Processing). Edited by Guillaume Bouchard, Hal Daumé III, Marc Dymetman, Yee Whye Teh. Helsinki, Finland.

Abstract▼ URL

In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin’s (1993) verb classes using the recently introduced V-measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.

#### Probabilistic Models for Data Combination in Recommender Systems

Sinead Williamson, Zoubin Ghahramani, 2008. (In Learning from Multiple Sources Workshop, NIPS Conference). Whistler Canada.

#### Flexible latent variable models for multi-task learning

J. Zhang, Z. Ghahramani, Y. Yang, December 2008. (Machine Learning). Springer Netherlands.

Abstract▼ URL

Given multiple prediction problems such as regression and classification, we are interested in a joint inference framework which can effectively borrow information among tasks to improve the prediction accuracy, especially when the number of training examples per problem is small. In this paper we propose a probabilistic framework which can support a set of latent variable models for different multi-task learning scenarios. We show that the framework is a generalization of standard learning methods for single prediction problems and it can effectively model the shared structure among different prediction tasks. Furthermore, we present efficient algorithms for the empirical Bayes method as well as point estimation. Our experiments on both simulated datasets and real world classification datasets show the effectiveness of the proposed models in two evaluation settings: standard multi-task learning setting and transfer learning setting.

## 2007

#### The Rendezvous algorithm: multiclass semi-supervised learning with Markov random walks

Arik Azran, June 2007. (In 24th International Conference on Machine Learning). Edited by Zoubin Ghahramani. Corvallis, OR, USA. Omnipress.

Abstract▼ URL

We consider the problem of multiclass classification where both labeled and unlabeled data points are given. We introduce and demonstrate a new approach for estimating a distribution over the missing labels where data points are viewed as nodes