# Monte Carlo Methods

Markov chain Monte Carlo (MCMC) methods use sampling to approximate high dimensional integrals and intractable sums. MCMC methods are widely used in many areas of science, applied mathematics and engineering. They are an indispensable approximate inference tool for Bayesian statistics and machine learning.

#### Sampling and inference for discrete random probability measures in probabilistic programs

Ben Bloem-Reddy, Emile Mathieu, Adam Foster, Tom Rainforth, Hong Ge, Maria Lomeli, Zoubin Ghahramani, December 2017. (In NIPS workshop on Advances in Approximate Inference). California, United States.

Abstract▼ URL

We consider the problem of sampling a sequence from a discrete random prob- ability measure (RPM) with countable support, under (probabilistic) constraints of finite memory and computation. A canonical example is sampling from the Dirichlet Process, which can be accomplished using its well-known stick-breaking representation and lazy initialization of its atoms. We show that efficiently lazy initialization is possible if and only if a size-biased representation of the discrete RPM is known. For models constructed from such discrete RPMs, we consider the implications for generic particle-based inference methods in probabilistic program- ming systems. To demonstrate, we implement posterior inference for Normalized Inverse Gaussian Process mixture models in Turing.

#### Equilibrium simulations of proteins using molecular fragment replacement and NMR chemical shifts

Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas Hamelryck, Kresten Lindorff-Larsen, Michele Vendruscolo, 2014. (Proceedings of the National Academy of Sciences). **DOI**: 10.1073/pnas.1404948111.

Abstract▼

Methods of protein structure determination based on NMR chemical shifts are becoming increasingly common. The most widely used approaches adopt the molecular fragment replacement strategy, in which structural fragments are repeatedly reassembled into different complete conformations in molecular simulations. Although these approaches are effective in generating individual structures consistent with the chemical shift data, they do not enable the sampling of the conformational space of proteins with correct statistical weights. Here, we present a method of molecular fragment replacement that makes it possible to perform equilibrium simulations of proteins, and hence to determine their free energy landscapes. This strategy is based on the encoding of the chemical shift information in a probabilistic model in Markov chain Monte Carlo simulations. First, we demonstrate that with this approach it is possible to fold proteins to their native states starting from extended structures. Second, we show that the method satisfies the detailed balance condition and hence it can be used to carry out an equilibrium sampling from the Boltzmann distribution corresponding to the force field used in the simulations. Third, by comparing the results of simulations carried out with and without chemical shift restraints we describe quantitatively the effects that these restraints have on the free energy landscapes of proteins. Taken together, these results demonstrate that the molecular fragment replacement strategy can be used in combination with chemical shift information to characterize not only the native structures of proteins but also their conformational fluctuations.

#### Scaling the iHMM: Parallelization versus Hadoop

Sébastien Bratières, Jurgen Van Gael, Andreas Vlachos, Zoubin Ghahramani, 2010. (In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology). Bradford, UK. IEEE Computer Society. **DOI**: 10.1109/CIT.2010.223. **ISBN**: 978-0-7695-4108-2.

Abstract▼ URL

This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM Beal, Ghahramani and Rasmussen, 2002, in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program. We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.

#### Unifying Orthogonal Monte Carlo Methods

Krzysztof Choromanski, Mark Rowland, Wenyu Chen, Adrian Weller, June 2019. (In 36th International Conference on Machine Learning). Long Beach.

Abstract▼ URL

Many machine learning methods making use of Monte Carlo sampling in vector spaces have been shown to be improved by conditioning samples to be mutually orthogonal. Exact orthogonal coupling of samples is computationally intensive, hence approximate methods have been of great interest. In this paper, we present a unifying perspective of many approximate methods by considering Givens transformations, propose new approximate methods based on this framework, and demonstrate the first statistical guarantees for families of approximate methods in kernel approximation. We provide extensive empirical evaluations with guidance for practitioners.

#### The Geometry of Random Features

Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard E. Turner, Adrian Weller, April 2018. (In 21st International Conference on Artificial Intelligence and Statistics). Playa Blanca, Lanzarote, Canary Islands.

Abstract▼ URL

We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal (in terms of mean squared error) kernel estimators. We provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms.

#### The unreasonable effectiveness of structured random orthogonal embeddings

Krzysztof Choromanski, Mark Rowland, Adrian Weller, December 2017. (In Advances in Neural Information Processing Systems 31). Long Beach, California.

Abstract▼ URL

We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications.

#### Accelerated Gibbs sampling for the Indian buffet process

Finale Doshi-Velez, Zoubin Ghahramani, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a non-parametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

#### Accelerated sampling for the Indian Buffet Process

Finale Doshi-Velez, Zoubin Ghahramani, 2009. (In ICML). Edited by Andrea Pohoreckyj Danyluk, Léon Bottou, Michael L. Littman. acm. ACM International Conference Proceeding Series. **ISBN**: 978-1-60558-516-1.

Abstract▼ URL

We often seek to identify co-occurring hidden features in a set of observations. The Indian Buffet Process (IBP) provides a nonparametric prior on the features present in each observation, but current inference techniques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of observations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.

#### On a class of sigma-Stable Poisson-Kingman models and an effective marginalised sampler

Stefano Favaro, Maria Lomeli, Yee Whye Teh, 2015. (Statistics and Computing).

Abstract▼ URL

We investigate the use of a large class of discrete random probability measures, which is referred to as the class Q, , in the context of Bayesian nonparametric mixture modeling. The class Q encompasses both the the two-parameter Poisson?Dirichlet process and the normalized generalized Gamma process, thus allowing us to comparatively study the inferential advantages of these two well-known nonparametric priors. Apart from ahighly flexible parameterization, the distinguishing feature of the class Q is the availability of a tractable posterior distribution. This feature, in turn, leads to derive an efficient marginal MCMC algorithm for posterior sampling within the framework of mixture models. We demonstrate the efficacy of our modeling framework on both one-dimensional and multi-dimensional datasets.

#### Fast relative Entropy coding with A* coding

Gergely Flamich, Stratis Markou, José Miguel Hernández-Lobato, 2022. (In 39th International Conference on Machine Learning).

Abstract▼ URL

Relative entropy coding (REC) algorithms encode a sample from a target distribution Q using a proposal distribution P, such that the expected codelength is 𝒪(D_KL[Q||P]). REC can be seamlessly integrated with existing learned compression models since, unlike entropy coding, it does not assume discrete Q or P, and does not require quantisation. However, general REC algorithms require an intractable Ω(e^D_KL[Q||P]) runtime. We introduce AS* and AD* coding, two REC algorithms based on A* sampling. We prove that, for continuous distributions over ℝ, if the density ratio is unimodal, AS* has 𝒪(D_∞[Q||P]QP) expected runtime, where D_∞[Q||P]QP is the Rényi ∞-divergence. We provide experimental evidence that AD* also has 𝒪(D_∞[Q||P]QP) expected runtime. We prove that AS* and AD* achieve an expected codelength of 𝒪(D_KL[Q||P]). Further, we introduce DAD*, an approximate algorithm based on AD* which retains its favourable runtime and has bias similar to that of alternative methods. Focusing on VAEs, we propose the IsoKL VAE (IKVAE), which can be used with DAD* to further improve compression efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can losslessly compress images near the theoretically optimal limit.

#### Combining the multicanonical ensemble with generative probabilistic models of local biomolecular structure

Jes Frellsen, Thomas Hamelryck, Jesper Ferkinghoff-Borg, 2014. (In Proceedings of the 59th World Statistics Congress of the International Statistical Institute). Hong Kong.

Abstract▼ URL

Markov chain Monte Carlo is a powerful tool for sampling complex systems such as large biomolecular structures. However, the standard Metropolis-Hastings algorithm suffers from a number of deficiencies when applied to systems with rugged free-energy landscapes. Some of these deficiencies can be addressed with the multicanonical ensemble. In this paper we will present two strategies for applying the multicanonical ensemble to distributions constructed from generative probabilistic models of local biomolecular structure. In particular, we will describe how to use the multicanonical ensemble efficiently in conjunction with the reference ratio method.

#### Bayesian generalised ensemble Markov chain Monte Carlo

Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg, May 2016. (In 19th International Conference on Artificial Intelligence and Statistics). Cadiz, Spain.

Abstract▼

Bayesian generalised ensemble (BayesGE) is a new method that addresses two major drawbacks of standard Markov chain Monte Carlo algorithms for inference in high-dimensional probability models: inapplicability to estimate the partition function, and poor mixing properties. BayesGE uses a Bayesian approach to iteratively update the belief about the density of states (distribution of the log likelihood under the prior) for the model, with the dual purpose of enhancing the sampling efficiency and make the estimation of the partition function tractable. We benchmark BayesGE on Ising and Potts systems and show that it compares favourably to existing state-of-the-art methods.

#### Metropolis algorithms for representative subgraph sampling

C. Hübler, K. Borgwardt, H.-P. Kriegel, Z. Ghahramani, December 2008. (In Proceedings of 8th IEEE International Conference on Data Mining (ICDM 2008)). Pisa, Italy. IEEE. **Note**: ISSN: 1550-4786.

Abstract▼ URL

While data mining in chemoinformatics studied graph data with dozens of nodes, systems biology and the Internet are now generating graph data with thousands and millions of nodes. Hence data mining faces the algorithmic challenge of coping with this significant increase in graph size: Classic algorithms for data analysis are often too expensive and too slow on large graphs. While one strategy to overcome this problem is to design novel efficient algorithms, the other is to ‘reduce’ the size of the large graph by sampling. This is the scope of this paper: We will present novel Metropolis algorithms for sampling a ‘representative’ small subgraph from the original large graph, with ‘representative’ describing the requirement that the sample shall preserve crucial graph properties of the original graph. In our experiments, we improve over the pioneering work of Leskovec and Faloutsos (KDD 2006), by producing representative subgraph samples that are both smaller and of higher quality than those produced by other methods from the literature.

#### Optimally-Weighted Herding is Bayesian Quadrature

Ferenc Huszár, David Duvenaud, July 2012. (In 28th Conference on Uncertainty in Artificial Intelligence). Catalina Island, California.

Abstract▼ URL

Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate.

#### Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara, Yutian Chen, Max Welling, June 2014. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time.

**Comment:** supplementary

#### Sparse Gaussian Process Hyperparameters: Optimize or Integrate?

Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt, Carl E. Rasmussen, 2022. (In nips36).

Abstract▼ URL

The kernel function and its hyperparameters are the central model selection choice in a Gaussian process [Rasmussen and Williams, 2006]. Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of [Titsias, 2009]. This work is closely related to Hensman et al. [2015b], but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.

#### Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Q Davis, Adrian Weller, 18–24 Jul 2021. (In Proceedings of the 38th International Conference on Machine Learning). Edited by Marina Meila, Tong Zhang. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Approximate bi-level optimization (ABLO) consists of (outer-level) optimization problems, involving numerical (inner-level) optimization loops. While ABLO has many applications across deep learning, it suffers from time and memory complexity proportional to the length r of its inner optimization loop. To address this complexity, an earlier first-order method (FOM) was proposed as a heuristic which omits second derivative terms, yielding significant speed gains and requiring only constant memory. Despite FOM’s popularity, there is a lack of theoretical understanding of its convergence properties. We contribute by theoretically characterizing FOM’s gradient bias under mild assumptions. We further demonstrate a rich family of examples where FOM-based SGD does not converge to a stationary point of the ABLO objective. We address this concern by proposing an unbiased FOM (UFOM) enjoying constant memory complexity as a function of r. We characterize the introduced time-variance tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a given ABLO problem. Finally, we propose an efficient adaptive UFOM scheme.

#### General Bayesian inference schemes in infinite mixture models

Maria Lomeli, 2017. University College London,Gatsby Unit, London, UK.

Abstract▼ URL

Bayesian statistical models allow us to formalise our knowledge about the world and reason about our uncertainty, but there is a need for better procedures to accurately encode its complexity. One way to do so is through compositional models, which are formed by combining blocks consisting of simpler models. One can increase the complexity of the compositional model by either stacking more blocks or by using a not-so-simple model as a building block. This thesis is an example of the latter. One first aim is to expand the choice of Bayesian nonparametric (BNP) blocks for constructing tractable compositional models. So far, most of the models that have a Bayesian nonparametric component use a Dirichlet Process or a Pitman-Yor process because of the availability of tractable and compact representations. This thesis shows how to overcome certain intractabilities in order to obtain analogous compact representations for the class of Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes. A major impediment to the widespread use of Bayesian nonparametric building blocks is that inference is often costly, intractable or difficult to carry out. This is an active research area since dealing with the model’s infinite dimensional component forbids the direct use of standard simulation-based methods. The main contribution of this thesis is a variety of inference schemes that tackle this problem: Markov chain Monte Carlo and Sequential Monte Carlo methods, which are exact inference schemes since they target the true posterior. The contributions of this thesis, in a larger context, provide general purpose exact inference schemes in the flavour or probabilistic programming: the user is able to choose from a variety of models, focusing only on the modelling part. Indeed, if the wide enough class of Poisson-Kingman priors is used as one of our blocks, this objective is achieved.

#### A hybrid sampler for Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, December 2015. (In Advances in Neural Information Processing Systems 28). Montreal, Canada.

Abstract▼ URL

This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers.

#### A marginal sampler for sigma-Stable Poisson-Kingman mixture models

Maria Lomeli, Stefano Favaro, Yee Whye Teh, 2017. (Journal of Computational and Graphical Statistics).

Abstract▼ URL

We investigate the class of sigma-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs, which encompasses most of the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process, and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterizations of sigma-stable Poisson-Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for performing posterior inference with a Bayesian nonparametric mixture model. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a small number of auxiliary variables per iteration. We apply our sampling scheme to a density estimation and clustering tasks with unidimensional and multidimensional datasets, and compare it against competing MCMC sampling schemes. Supplementary materials for this article are available online.

#### Antithetic and Monte Carlo kernel estimators for partial rankings

Maria Lomeli, Mark Rowland, Arthur Gretton, Zoubin Ghahramani, 2018. (arXiv preprint arXiv:1807.00400).

Abstract▼ URL

In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which prevents the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator for the Mallows kernel. The corresponding antithetic kernel estimator has lower variance and we demonstrate empirically that it has a better performance in a variety of Machine Learning tasks. Both kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. An overview of the existing kernels and metrics for permutations is also provided.

#### Bayesian Learning in Undirected Graphical Models: Approximate MCMC Algorithms

Iain Murray, Zoubin Ghahramani, 2004. (In UAI). Edited by David Maxwell Chickering, Joseph Y. Halpern. AUAI Press. **ISBN**: 0-9749039-0-6.

Abstract▼ URL

Bayesian learning in undirected graphical models|computing posterior distributions over parameters and predictive quantities is exceptionally difficult. We conjecture that for general undirected models, there are no tractable MCMC (Markov Chain Monte Carlo) schemes giving the correct equilibrium distribution over parameters. While this intractability, due to the partition function, is familiar to those performing parameter optimisation, Bayesian learning of posterior distributions over undirected model parameters has been unexplored and poses novel challenges. we propose several approximate MCMC schemes and test on fully observed binary models (Boltzmann machines) for a small coronary heart disease data set and larger artificial systems. While approximations must perform well on the model, their interaction with the sampling scheme is also important. Samplers based on variational mean- field approximations generally performed poorly, more advanced methods using loopy propagation, brief sampling and stochastic dynamics lead to acceptable parameter posteriors. Finally, we demonstrate these techniques on a Markov random field with hidden variables.

#### MCMC for Doubly-intractable Distributions

Iain Murray, Zoubin Ghahramani, David J. C. MacKay, 2006. (In UAI). AUAI Press. **ISBN**: 0-9749039-2-2.

Abstract▼ URL

Markov Chain Monte Carlo (MCMC) algorithms are routinely used to draw samples from distributions with intractable normalization constants. However, standard MCMC algorithms do not apply to doubly-intractable distributions in which there are additional parameter-dependent normalization terms; for example, the posterior over parameters of an undirected graphical model. An ingenious auxiliary-variable scheme (Møller et al., 2004) offers a solution: exact sampling (Propp and Wilson, 1996) is used to sample from a Metropolis–Hastings proposal for which the acceptance probability is tractable. Unfortunately the acceptance probability of these expensive updates can be low. This paper provides a generalization of Møller et al. (2004) and a new MCMC algorithm, which obtains better acceptance probabilities for the same amount of exact sampling, and removes the need to estimate model parameters before sampling begins.

#### Nested sampling for Potts models

Iain Murray, David J. C. MacKay, Zoubin Ghahramani, John Skilling, 2005. (In NIPS).

Abstract▼ URL

Nested sampling is a new Monte Carlo method by Skilling intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model.

#### PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya, 2018. (In 35th International Conference on Machine Learning).

Abstract▼ URL

Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by 106 times.

#### Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian Integrals

Carl Edward Rasmussen, 2003. (In Bayesian Statistics 7). Edited by J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, M. West. Oxford University Press.

Abstract▼ URL

Hybrid Monte Carlo (HMC) is often the method of choice for computing Bayesian integrals that are not analytically tractable. However the success of this method may require a very large number of evaluations of the (un-normalized) posterior and its partial derivatives. In situations where the posterior is computationally costly to evaluate, this may lead to an unacceptable computational load for HMC. I propose to use a Gaussian Process model of the (log of the) posterior for most of the computations required by HMC. Within this scheme only occasional evaluation of the actual posterior is required to guarantee that the samples generated have exactly the desired distribution, even if the GP model is somewhat inaccurate. The method is demonstrated on a 10 dimensional problem, where 200 evaluations suffice for the generation of 100 roughly independent points from the posterior. Thus, the proposed scheme allows Bayesian treatment of models with posteriors that are computationally demanding, such as models involving computer simulation.

#### A practical Monte Carlo implementation of Bayesian learning

Carl Edward Rasmussen, 1996. (In Advances in Neural Information Processing Systems 8). Edited by D. S. Touretzky, M. C. Mozer, M. E. Hasselmo. Cambridge, MA., USA. The MIT Press.

Abstract▼ URL

A practical method for Bayesian training of feed-forward neural networks using sophisticated Monte Carlo methods is presented and evaluated. In reasonably small amounts of computer time this approach outperforms other state-of-the-art methods on 5 datalimited tasks from real world domains.

#### Bayesian Monte Carlo

Carl Edward Rasmussen, Zoubin Ghahramani, December 2003. (In Advances in Neural Information Processing Systems 15). Edited by S. Becker, S. Thrun, K. Obermayer. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

#### Geometrically coupled Monte Carlo sampling

Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas Sarlos, Richard Turner, Adrian Weller, December 2018. (In Advances in Neural Information Processing Systems 32). Montreal Canada.

Abstract▼ URL

Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including quasi-Monte Carlo, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling.

#### Orthogonal Estimation of Wasserstein Distances

Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, Adrian Weller, April 2019. (In 22nd International Conference on Artificial Intelligence and Statistics). Okinawa, Japan.

Abstract▼ URL

Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning.

#### Kernel adaptive Metropolis-Hastings

Dino Sejdinovic, Heiko Strathmann, Maria Lomeli, Christophe Andrieu, Arthur Gretton, June 2012. (In 31st International Conference on Machine Learning). Beijing, China.

Abstract▼ URL

A Kernel Adaptive Metropolis-Hastings algo- rithm is introduced, for the purpose of sampling from a target distribution with strongly nonlin- ear support. The algorithm embeds the trajec- tory of the Markov chain into a reproducing ker- nel Hilbert space (RKHS), such that the fea- ture space covariance of the samples informs the choice of proposal. The procedure is com- putationally efficient and straightforward to im- plement, since the RKHS moves can be inte- grated out analytically: our proposal distribu- tion in the original space is a normal distribution whose mean and covariance depend on where the current sample lies in the support of the tar- get distribution, and adapts to its local covari- ance structure. Furthermore, the procedure re- quires neither gradients nor any other higher or- der information about the target, making it par- ticularly attractive for contexts such as Pseudo- Marginal MCMC. Kernel Adaptive Metropolis- Hastings outperforms competing fixed and adap- tive samplers on multivariate, highly nonlinear target distributions, arising in both real-world and synthetic examples.

#### Marginalised Gaussian Processes with Nested Sampling

Fergus Simpson, Vidhi Lalchand, Carl Edward Rasmussen, 2021. (In Advances in Neural Information Processing Systems 34). Curran Associates, Inc..

Abstract▼ URL

Gaussian Process models are a rich distribution over functions with inductive biases controlled by a kernel function. Learning occurs through optimisation of the kernel hyperparameters using the marginal likelihood as the objective. This work proposes nested sampling as a means of marginalising kernel hyperparameters, because it is a technique that is well-suited to exploring complex, multi-modal distributions. We benchmark against Hamiltonian Monte Carlo on time-series and two-dimensional regression tasks, finding that a principled approach to quantifying hyperparameter uncertainty substantially improves the quality of prediction intervals.

#### Magnetic Hamiltonian Monte Carlo

Nilesh Tripuraneni, Mark Rowland, Zoubin Ghahramani, Richard E. Turner, 2017. (In 34th International Conference on Machine Learning).

Abstract▼ URL

Hamiltonian Monte Carlo (HMC) exploits Hamiltonian dynamics to construct efficient proposals for Markov chain Monte Carlo (MCMC). In this paper, we present a generalization of HMC which exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the mechanics of a charged particle coupled to a magnetic field. We establish a theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC, and construct a symplectic, leapfrog-like integrator allowing for the implementation of magnetic HMC. Finally, we exhibit several examples where these non-canonical dynamics can lead to improved mixing of magnetic HMC relative to ordinary HMC.

#### Dynamic Covariance Models for Multivariate Financial Time Series

Yue Wu, José Miguel Hernández-Lobato, Zoubin Ghahramani, June 2013. (In 30th International Conference on Machine Learning). Atlanta, Georgia, USA.

Abstract▼ URL

The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.

#### Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Yichuan Zhang, Charles A. Sutton, Amos J. Storkey, Zoubin Ghahramani, 2012. (In NIPS). Edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, Kilian Q. Weinberger.

Abstract▼ URL

Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems.

#### Distributed Inference for Dirichlet Process Mixture Models

Hong Ge, Yutian Chen, Moquan Wan, Zoubin Ghahramani, 07–09 Jul 2015. (In Proceedings of the 32nd International Conference on Machine Learning). Edited by Francis Bach, David Blei. Lille, France. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Bayesian nonparametric mixture models based on the Dirichlet process (DP) have been widely used for solving problems like clustering, density estimation and topic modelling. These models make weak assumptions about the underlying process that generated the observed data. Thus, when more data are collected, the complexity of these models can change accordingly. These theoretical properties often lead to superior predictive performance when compared to traditional finite mixture models. However, despite the increasing amount of data available, the application of Bayesian nonparametric mixture models is so far limited to relatively small data sets. In this paper, we propose an efficient distributed inference algorithm for the DP and the HDP mixture model. The proposed method is based on a variant of the slice sampler for DPs. Since this sampler does not involve a pre-determined truncation, the stationary distribution of the sampling algorithm is unbiased. We provide both local thread-level and distributed machine-level parallel implementations and study the performance of this sampler through an extensive set of experiments on image and text data. When compared to existing inference algorithms, the proposed method exhibits state-of-the-art accuracy and strong scalability with up to 512 cores.

#### Turing: A Language for Flexible Probabilistic Inference

Hong Ge, Kai Xu, Zoubin Ghahramani, 09–11 Apr 2018. (In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics). Edited by Amos Storkey, Fernando Perez-Cruz. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Probabilistic programming promises to simplify and democratize probabilistic machine learning, but successful probabilistic programming systems require flexible, generic and efficient inference engines. In this work, we present a system called Turing for building MCMC algorithms for probabilistic programming inference. Turing has a very simple syntax and makes full use of the numerical capabilities in the Julia programming language, including all implemented probability distributions, and automatic differentiation. Turing supports a wide range of popular Monte Carlo algorithms, including Hamiltonian Monte Carlo (HMC), HMC with No-U-Turns (NUTS), Gibbs sampling, sequential Monte Carlo (SMC), and several particle MCMC (PMCMC) samplers. Most importantly, Turing inference is composable: it combines MCMC operations on subsets of variables, for example using a combination of an HMC engine and a particle Gibbs (PG) engine. We explore several combinations of inference methods with the aim of finding approaches that are both efficient and universal, i.e. applicable to arbitrary probabilistic models. NUTS—a popular variant of HMC that adapts Hamiltonian simulation path length automatically, although quite powerful for exploring differentiable target distributions, is however not universal. We identify some failure modes for the NUTS engine, and demonstrate that composition of PG (for discrete variables) and NUTS (for continuous variables) can be useful when the NUTS engine is either not applicable, or simply does not work well. Our aim is to present Turing and its composable inference engines to the world and encourage other researchers to build on this system to help advance the field of probabilistic machine learning.

#### Interoperability of statistical models in pandemic preparedness: principles and reality

George Nicholson, Marta Blangiardo, Mark Briers, Peter J Diggle, Tor Erlend Fjelde, Hong Ge, Robert J B Goudie, Radka Jersakova, Ruairidh E King, Brieuc C L Lehmann, Ann-Marie Mallon, Tullia Padellini, Yee Whye Teh, Chris Holmes, Sylvia Richardson, May 2022. (Stat. Sci.).

Abstract▼ URL

We present interoperability as a guiding framework for statistical modelling to assist policy makers asking multiple questions using diverse datasets in the face of an evolving pandemic response. Interoperability provides an important set of principles for future pandemic preparedness, through the joint design and deployment of adaptable systems of statistical models for disease surveillance using probabilistic reasoning. We illustrate this through case studies for inferring and characterising spatial-temporal prevalence and reproduction numbers of SARS-CoV-2 infections in England.

#### Couplings for Multinomial Hamiltonian Monte Carlo

Kai Xu, Tor Erlend Fjelde, Charles Sutton, Hong Ge, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time – the time it takes for the coupled chains to meet – based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods.