In this work we develop a fast saliency detection method that can be applied to any differentiable image classifier. We train a masking model to manipulate the scores of the classifier by masking salient parts of the input image. Our model generalises well to unseen images and requires a single forward pass to perform saliency detection, therefore suitable for use in real-time systems. We test our approach on CIFAR-10 and ImageNet datasets and show that the produced saliency maps are easily interpretable, sharp, and free of artifacts. We suggest a new metric for saliency and test our method on the ImageNet object localisation task. We achieve results outperforming other weakly supervised methods.Piotr Dabkowski, Yarin Gal

[arXiv] [BibTex]

Dropout is used as a practical tool to obtain uncertainty estimates in large vision models and reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty estimates, a grid-search over the dropout probabilities is necessary - a prohibitive operation with large models, and an impossible one with RL. We propose a new dropout variant which gives improved performance and better calibrated uncertainties. Relying on recent developments in Bayesian deep learning, we use a continuous relaxation of dropout's discrete masks. Together with a principled optimisation objective, this allows for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles. In RL this allows the agent to adapt its uncertainty dynamically as more data is observed. We analyse the proposed variant extensively on a range of tasks, and give insights into common practice in the field where larger dropout probabilities are often used in deeper model layers.Yarin Gal, Jiri Hron, Alex Kendall

[arXiv] [Software] [BibTex]

Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.Alex Kendall, Yarin Gal, Roberto Cipolla

[arXiv] [Software] [BibTex]

There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks.Alex Kendall, Yarin Gal

[arXiv] [BibTex]

To obtain uncertainty estimates with real-world Bayesian deep learning models, practical inference approximations are needed. Dropout variational inference (VI) for example has been used for machine vision and medical applications, but VI can severely underestimates model uncertainty. Alpha-divergences are alternative divergences to VI’s KL objective, which are able to avoid VI’s uncertainty underestimation. But these are hard to use in practice: existing techniques can only use Gaussian approximating distributions, and require existing models to be changed radically, thus are of limited use for practitioners. We propose a re-parametrisation of the alpha-divergence objectives, deriving a simple inference technique which, together with dropout, can be easily implemented with existing models by simply changing the loss of the model. We demonstrate improved uncertainty estimates and accuracy compared to VI in dropout networks. We study our model’s epistemic uncertainty far away from the data using adversarial images, showing that these can be distinguished from non-adversarial images by examining our model’s uncertainty.Yingzhen Li, Yarin Gal

[Paper] [arXiv] [BibTex]

Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Relying on Bayesian approaches to deep learning, in this paper we combine recent advances in Bayesian deep learning into the active learning framework in a practical way. We develop an active learning framework for high dimensional data, a task which has been extremely challenging so far with very sparse existing literature, and demonstrate it in melanoma (skin cancer) diagnosis.Yarin Gal, Riashat Islam, Zoubin Ghahramani

[PDF] [Poster] [Code] [BibTex]

[Paper] [arXiv] [BibTex]

So I finally submitted my PhD thesis. In it I organised the already published results on how to obtain uncertainty in deep learning, and collected lots of bits and pieces of new research I had lying around (which I hadn't had the time to publish yet).Yarin Gal

[PDF] [Blog post] [BibTex]

We present a new technique for recurrent neural network regularisation, relying on recent results at the intersection of Bayesian modelling and deep learning. Our RNN dropout variant is theoretically motivated and its effectiveness is demonstrated empirically, with the new approach improving on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.Yarin Gal, Zoubin Ghahramani

[arXiv] [Software] [BibTex]

[Paper] [Poster]

[Paper] [BibTex]

We attempt to answer PILCO's shortcomings by replacing its Gaussian process with a Bayesian deep dynamics model, while maintaining the framework’s probabilistic nature and its data-efficiency benefits. This task poses several interesting difficulties. First, we have to handle small data, and neural networks are notoriously known for their tendency to overfit. Furthermore, we must retain PILCO's ability to capture 1) dynamics model output uncertainty and 2) input uncertainty.Yarin Gal, Rowan Mcallister and Carl E. Rasmussen

[Paper] [Abstract] [Poster] [BibTex]

Bayesian modelling and variational inference are rooted in Bayesian statistics, and easily benefit from the vast literature in the field. In contrast, deep learning lacks a solid mathematical grounding. Instead, empirical developments in deep learning are often justified by metaphors, evading the unexplained principles at play. In this paper we extend previous results casting modern deep learning models as performing approximate variational inference in a Bayesian setting, and survey open problems to research.Yarin Gal, Zoubin Ghahramani

[PDF] [Poster] [BibTex]

Perhaps ironically, the deep learning community is far closer to our vision of ``automated modelling'' than the probabilistic modelling community. Many complex models in deep learning can be easily implemented and tested, while variational inference (VI) techniques require specialised knowledge and long development cycles, making them extremely challenging for non-experts. We discuss a possible solution lifted from manufacturing. Similar ideas in deep learning have led to rapid development in model complexity, speeding up the innovation cycle.Yarin Gal

[PDF] [Poster] [BibTex]

We present an efficient Bayesian convolutional neural network (convnet). The model offers better robustness to over-fitting on small data and achieves a considerable improvement in classification accuracy compared to previous approaches. We give state-of-the-art results on CIFAR-10 following our insights.Yarin Gal, Zoubin Ghahramani

[arXiv] [Software] [BibTex]

[CMT Reviews] [OpenReview] [BibTex]

We show that dropout in multilayer perceptron models (MLPs) can be interpreted as a Bayesian approximation. Results are obtained for modelling uncertainty for dropout MLP models - extracting information that has been thrown away so far, from existing models. This mitigates the problem of representing uncertainty in deep learning without sacrificing computational performance or test accuracy.Yarin Gal, Zoubin Ghahramani

[arXiv] [BibTex] [Appendix] [BibTex] [Software]

[Paper] [Presentation] [Poster] [BibTex]

Deep learning techniques lack the ability to reason about uncertainty over the features. We show that a multilayer perceptron (MLP) with arbitrary depth and non-linearities, with dropout applied after every weight layer, is mathematically equivalent to an approximation to a well known Bayesian model. This paper is a short version of the appendix of "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning".Yarin Gal, Zoubin Ghahramani

[PDF] [Poster] [BibTex]

We define a new process that gives a natural generalisation of the Indian buffet process (used for binary feature allocation) into categorical latent features. For this we take advantage of different limit parametrisations of the Dirichlet process and its generalisation the Pitman–Yor process.Yarin Gal, Tomoharu Iwata, Zoubin Ghahramani

[Presentation] [BibTex]

Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We use variational inference for the sparse spectrum approximation to avoid both issues. We extend the approximate inference to the distributed and stochastic domains.Yarin Gal, Richard Turner

[PDF] [Presentation] [Poster] [Software] [BibTex]

Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data.Yarin Gal, Yutian Chen, Zoubin Ghahramani

[PDF] [Poster] [Presentation] [BibTex]

[PDF] [Presentation] [Poster] [Software] [BibTex]

We develop parallel inference for sparse Gaussian process regression and latent variable models. These processes are used to model functions in a principled way and for non-linear dimensionality reduction in linear time complexity. Using parallel inference we allow the models to work on much larger datasets than before.Yarin Gal, Mark van der Wilk, Carl E. Rasmussen

[arXiv] [Presentation] [Software] [BibTex]

[PDF] [BibTex]

We define a new combinatorial structure that unifies Kingman's random partitions and Broderick, Pitman, and Jordan's feature frequency models. This structure underlies non-parametric multi-view clustering models, where data points are simultaneously clustered into different possible clusterings. The de Finetti measure is a product of paintbox constructions. Studying the properties of feature partitions allows us to understand the relations between the models they underlie and share algorithmic insights between them.Yarin Gal, Zoubin Ghahramani

[Link] [Poster]

We introduce a new class of models over trees based on the theory of fragmentation processes. The Dirichlet Fragmentation Process Mixture Model is an example model derived from this new class. This model has efficient and simple inference, and significantly outperforms existing approaches for hierarchical clustering and density modelling.Hong Ge, Yarin Gal, Zoubin Ghahramani

[PDF] [BibTex]

We show that the recently suggested parallel inference for the Dirichlet process is conceptually invalid. The Dirichlet process is important for many fields such as natural language processing. However the suggested inference would not work in most real-world applications.Yarin Gal, Zoubin Ghahramani

[PDF] [Presentation] [BibTex]

[PDF] [Talk] [Presentation] [Poster] [BibTex]

We present an in-depth and self-contained tutorial for sparse Gaussian Process (GP) regression. We also explain GP latent variable models, a tool for non-linear dimensionality reduction. The sparse approximation reduces the time complexity of the models from cubic to linear but its development is scattered across the literature. The various results are collected here.Yarin Gal, Mark van der Wilk

[arXiv] [BibTex]

Over the past 50 years many have debated what representation should be used to capture the meaning of natural language utterances. Recently new needs of such representations have been raised in research. Here I survey some of the interesting representations suggested to answer for these new needs.Yarin Gal

[arXiv] [BibTex]

We used a non-parametric process — the hierarchical Pitman–Yor process — in models that align words between pairs of sentences. These alignment models are used at the core of all machine translation systems. We obtained a significant improvement in translation using the process.Yarin Gal, Phil Blunsom

[PDF] [Presentation] [BibTex]

We used a non-parametric process — the hierarchical Pitman–Yor process — to relax some of the restricting assumptions often used in machine translation. When a long history of word alignments is not available the process falls-back onto shorter histories in a principled way.Yarin Gal

[PDF] [BibTex]

We trained a feed-forward neural network to play checkers. The network acts as both the value function for a min-max algorithm and a heuristic for pruning tree branches in a reinforcement learning setting. We used no supervised signal for training - a set of networks was assessed by playing against each-other and the winning networks' weights were adapted following the ES algorithm.Yarin Gal, Mireille Avigal

[Paper] [BibTex]

Engineering Department

Cambridge, CB2 1PZ

United Kingdom