# Uncategorized

#### Bayesian nonnegative matrix factorization with volume prior for unmixing of hyperspectral images

Morten Arngren, Mikkel N. Schmidt, Jan Larsen, September 2009. (In Machine Learning for Signal Processing, IEEE Workshop on (MLSP)). Grenoble, France. **DOI**: 10.1109/MLSP.2009.5306262. **ISBN**: 978-1-4244-4947-7.

Abstract▼ URL

In hyperspectral image analysis the objective is to unmix a set of acquired pixels into pure spectral signatures (endmembers) and corresponding fractional abundances. The Non-negative Matrix Factorization (NMF) methods have received a lot of attention for this unmixing process. Many of these NMF based unmixing algorithms are based on sparsity regularization encouraging pure spectral endmembers, but this is not optimal for certain applications, such as foods, where abundances are not sparse. The pixels will theoretically lie on a simplex and hence the endmembers can be estimated as the vertices of the smallest enclosing simplex. In this context we present a Bayesian framework employing a volume constraint for the NMF algorithm, where the posterior distribution is numerically sampled from using a Gibbs sampling procedure. We evaluate the method on synthetical and real hyperspectral data of wheat kernels.

**Comment:** This paper was “rated among the best papers submitted” to the 2009 Machine Learning for Signal Processing conference.

#### A new approach to data driven clustering

Arik Azran, Zoubin Ghahramani, June 2006. (In 23rd International Conference on Machine Learning). Edited by William Cohen, Andrew Moore. Pittsburgh, PA, USA. Omnipress.

Abstract▼

We consider the problem of clustering in its most basic form where only a local metric on the data space is given. No parametric statistical model is assumed, and the number of clusters is learned from the data. We introduce, analyze and demonstrate a novel approach to clustering where data points are viewed as nodes of a graph, and pairwise similarities are used to derive a transition probability matrix P for a Markov random walk between them. The algorithm automatically reveals structure at increasing scales by varying the number of steps taken by this random walk. Points are represented as rows of Pt, which are the t-step distributions of the walk starting at that point; these distributions are then clustered using a KL-minimizing iterative algorithm. Both the number of clusters, and the number of steps that best reveal it, are found by optimizing spectral properties of P.

#### Spectral Methods for Automatic Multiscale Data Clustering

Arik Azran, Zoubin Ghahramani, June 2006. (In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)). New York, NY, USA. IEEE Computer Society. **DOI**: 10.1109/CVPR.2006.289. **ISBN**: 0-7695-2597-0.

Abstract▼

Spectral clustering is a simple yet powerful method for finding structure in data using spectral properties of an associated pairwise similarity matrix. This paper provides new insights into how the method works and uses these to derive new algorithms which given the data alone automatically learn different plausible data partitionings. The main theoretical contribution is a generalization of a key result in the field, the multicut lemma (Meila 2001). We use this generalization to derive two algorithms. The first uses the eigenvalues of a given affinity matrix to infer the number of clusters in data, and the second combines learning the affinity matrix with inferring the number of clusters. A hierarchical implementation of the algorithms is also derived. The algorithms are theoretically motivated and demonstrated on nontrivial data sets.

#### The dynamic beamformer

A. Bahramisharif, M. A. J. van Gerven, J-M. Schoffelen, Z. Ghahramani, T. Heskes, 2012. (In Machine Learning in Interpretation of Neuroimaging (MLINI) 2011 LNAI 7263). Edited by G. Langs et al.

Abstract▼ URL

Beamforming is one of the most commonly used methods for estimating the active neural sources from the MEG or EEG sensor readings. The basic assumption in beamforming is that the sources are uncorrelated, which allows for estimating each source independent of the others. In this paper, we incorporate the independence assumption of the standard beamformer in a linear dynamical system, thereby introducing the dynamic beamformer. Using empirical data, we show that the dynamic beamformer outperforms the standard beamformer in predicting the condition of interest which strongly suggests that it also outperforms the standard method in localizing the active neural generators.

#### Converting to Optimization in Machine Learning: Perturb-and-MAP, Differential Privacy, and Program Synthesis

Matej Balog, 2020. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

On a mathematical level, most computational problems encountered in machine learning are instances of one of four abstract, fundamental problems: sampling, integration, optimization, and search. Thanks to the rich history of the respective mathematical fields, disparate methods with different properties have been developed for these four problem classes. As a result it can be beneficial to convert a problem from one abstract class into a problem of a different class, because the latter might come with insights, techniques, and algorithms well suited to the particular problem at hand. In particular, this thesis contributes four new methods and generalizations of existing methods for converting specific non-optimization machine learning tasks into optimization problems with more appealing properties. The first example is partition function estimation (an integration problem), where an existing algorithm – the Gumbel trick – for converting to the MAP optimization problem is generalized into a more general family of algorithms, such that other instances of this family have better statistical properties. Second, this family of algorithms is further generalized to another integration problem, the problem of estimating Rényi entropies. The third example shows how an intractable sampling problem arising when wishing to publicly release a database containing sensitive data in a safe (“differentially private”) manner can be converted into an optimization problem using the theory of Reproducing Kernel Hilbert Spaces. Finally, the fourth case study casts the challenging discrete search problem of program synthesis from input-output examples as a supervised learning task that can be efficiently tackled using gradient-based optimization. In all four instances, the conversions result in novel algorithms with desirable properties. In the first instance, new generalizations of the Gumbel trick can be used to construct statistical estimators of the partition function that achieve the same estimation error while using up to 40% fewer samples. The second instance shows that unbiased estimators of the Rényi entropy can be constructed in the Perturb-and-MAP framework. The main contribution of the third instance is theoretical: the conversion shows that it is possible to construct an algorithm for releasing synthetic databases that approximate databases containing sensitive data in a mathematically precise sense, and to prove results about their approximation errors. Finally, the fourth conversion yields an algorithm for synthesising program source code from input-output examples that is able to solve test problems 1-3 orders of magnitude faster than a wide range of baselines.

#### Fabular: Regression Formulas As Probabilistic Programming

Johannes Borgström, Andrew D. Gordon, Long Ouyang, Claudio Russo, Adam Ścibior, Marcin Szymczak, 2016. (In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages). New York, NY, USA. St. Petersburg, FL, USA. acm. POPL 2016. **DOI**: 10.1145/2837614.2837653. **ISBN**: 978-1-4503-3549-2. **ACM ID**: 2837653.

Abstract▼ URL

Regression formulas are a domain-specific language adopted by several R packages for describing an important and useful class of statistical models: hierarchical linear regressions. Formulas are succinct, expressive, and clearly popular, so are they a useful addition to probabilistic programming languages? And what do they mean? We propose a core calculus of hierarchical linear regression, in which regression coefficients are themselves defined by nested regressions (unlike in R). We explain how our calculus captures the essence of the formula DSL found in R. We describe the design and implementation of Fabular, a version of the Tabular schema-driven probabilistic programming language, enriched with formulas based on our regression calculus. To the best of our knowledge, this is the first formal description of the core ideas of R’s formula notation, the first development of a calculus of regression formulas, and the first demonstration of the benefits of composing regression formulas and latent variables in a probabilistic programming language.

#### Motor coordination: When two have to act as one

Daniel A. Braun, Pedro A. Ortega, Daniel M. Wolpert, 2011. (Special issue of Experimental Brain Research on Joint Action).

Abstract▼ URL

Trying to pass someone walking toward you in a narrow corridor is a familiar example of a two-person motor game that requires coordination. In this study, we investigate coordination in sensorimotor tasks that correspond to classic coordination games with multiple Nash equilibria, such as “choosing sides”, “stag hunt”, “chicken”, and “battle of sexes”. In these tasks, subjects made reaching movements reflecting their continuously evolving “decisions” while they received a continuous payoff in the form of a resistive force counteracting their movements. Successful coordination required two subjects to “choose” the same Nash equilibrium in this force-payoff landscape within a single reach. We found that on the majority of trials coordination was achieved. Compared to the proportion of trials in which miscoordination occurred, successful coordination was characterized by several distinct features: an increased mutual information between the players’ movement endpoints, an increased joint entropy during the movements, and by differences in the timing of the players’ responses. Moreover, we found that the probability of successful coordination depends on the players’ initial distance from the Nash equilibria. Our results suggest that two-person coordination arises naturally in motor interactions and is facilitated by favorable initial positions, stereotypical motor pattern, and differences in response times.

#### Nash equilibria in multi-agent motor interactions

Daniel A. Braun, Pedro A. Ortega, Daniel M. Wolpert, 2009. (PLoS Computational Biology).

Abstract▼ URL

Social interactions in classic cognitive games likeBlack-box alpha (BB-α) is a new approximate inference method based on the minimization of α-divergences. BB-αscales to large datasets because it can be implemented using stochastic gradient descent. BB-αcan be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter α, the method is able to interpolate between variational Bayes (VB) (α→0) and an algorithm similar to expectation propagation (EP) (α= 1). Experiments on probit regression and neural network regression and classification problems show that BB-αwith non-standard settings of α, such as α= 0.5, usually produces better predictions than with α→0 (VB) or α= 1 (EP). the ultimatum game or the prisoner’s dilemma typically lead to Nash equilibria when multiple competitive decision makers with perfect knowledge select optimal strategies. However, in evolutionary game theory it has been shown that Nash equilibria can also arise as attractors in dynamical systems that can describe, for example, the population dynamics of microorganisms. Similar to such evolutionary dynamics, we find that Nash equilibria arise naturally in motor interactions in which players vie for control and try to minimize effort. When confronted with sensorimotor interaction tasks that correspond to the classical prisoner’s dilemma and the rope-pulling game, two-player motor interactions led predominantly to Nash solutions. In contrast, when a single player took both roles, playing the sensorimotor game bimanually, cooperative solutions were found. Our methodology opens up a new avenue for the study of human motor interactions within a game theoretic framework, suggesting that the coupling of motor systems can lead to game theoretic solutions.

#### A Distributed Mechanism for Multi-Agent Convex Optimisation and Coordination with No-Regret Learners

Jan-Peter Calliess, Nathan Korda, Geoffrey J. Gordon, December 2016. (In Workshop on Learning, Inference and Control of Multi-Agent Systems, NIPS). Barcelona, Spain.

Abstract▼ URL

We develop an indirect mechanism for coordinated, distributed multi-agent optimisation, and decision-making. Our approach extends previous work in no-regret learning based mechanism design and renders it applicable to partial information settings. We consider planning problems that can be stated as a collection of single-agent convex programmes coupled by common soft constraints. A key idea is to recast the joint optimisation problem as distributed learning in a repeated game between the original agents and a newly introduced group of adversarial agents who influence prices for decisions and facilitate coordination. Under the weak behavioural assumption that all agents employ selfish, sub-linear regret algorithms in the course of the repeated game, we guarantee that our mechanism can achieve design goals such as social optimality (efficiency) and Nash-equilibrium convergence to within an error which approaches zero as the agents gain experience. Our error bounds are deterministic or probabilistic, depending on the nature of the regret bounds available for the algorithms employed by the agents. We illustrate our method in an emissions market application.

#### Probabilistic models for incomplete multi-dimensional arrays

W. Chu, Z. Ghahramani, April 2009. (In 12th International Conference on Artificial Intelligence and Statistics). Edited by D. van Dyk, M. Welling. Clearwater Beach, FL, USA. Microtome Publishing (paper) Journal of Machine Learning Research. **Note**: ISSN 1938-7228.

Abstract▼ URL

In multiway data, each sample is measured by multiple sets of correlated attributes. We develop a probabilistic framework for modeling structural dependency from partially observed multi-dimensional array data, known as pTucker. Latent components associated with individual array dimensions are jointly retrieved while the core tensor is integrated out. The resulting algorithm is capable of handling large-scale data sets. We verify the usefulness of this approach by comparing against classical models on applications to modeling amino acid fluorescence, collaborative filtering and a number of benchmark multiway array data.

#### Eliciting and Learning with Soft Labels from Every Annotator

Katherine M. Collins, Umang Bhatt, Adrian Weller, 2022. (In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP)). **DOI**: 10.17863/CAM.87954.

Abstract▼ URL

The labels used to train machine learning (ML) models are of paramount importance. Typically for ML classification tasks, datasets contain hard labels, yet learning using soft labels has been shown to yield benefits for model generalization, robustness, and calibration. Earlier work found success in forming soft labels from multiple annotators’ hard labels; however, this approach may not converge to the best labels and necessitates many annotators, which can be expensive and inefficient. We focus on efficiently eliciting soft labels from individual annotators. We collect and release a dataset of soft labels (which we call CIFAR-10S) over the CIFAR-10 test set via a crowdsourcing study (N=248). We demonstrate that learning with our labels achieves comparable model performance to prior approaches while requiring far fewer annotators – albeit with significant temporal costs per elicitation. Our elicitation methodology therefore shows nuanced promise in enabling practitioners to enjoy the benefits of improved model performance and reliability with fewer annotators, and serves as a guide for future dataset curators on the benefits of leveraging richer information, such as categorical uncertainty, from individual annotators.

**Comment:** [Project Page] [Data] [Code]

#### The Infinite Partially Observable Markov Decision Process

Finale Doshi-Velez, December 2009. (In Advances in Neural Information Processing Systems 23). Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems.

#### Spoken Language Interaction with Model Uncertainty: An Adaptive Human-Robot Interaction System

Finale Doshi, Nicholas Roy, December 2008. (Connection Science).

Abstract▼ URL

Spoken language is one of the most intuitive forms of interaction between humans and agents. Unfortunately, agents that interact with people using natural language often experience communication errors and do not correctly understand the user’s intentions. Recent systems have successfully used probabilistic models of speech, language, and user behavior to generate robust dialog performance in the presence of noisy speech recognition and ambiguous language choices, but decisions made using these probabilistic models are still prone to errors due to the complexity of acquiring and maintaining a complete model of human language and behavior. In this paper, we describe a decision-theoretic model for human-robot interaction using natural language. Our algorithm is based on the Partially Observable Markov Decision Process (POMDP), which allows agents to choose actions that are robust not only to uncertainty from noisy or ambiguous speech recognition but also unknown user models. Like most dialog systems, a POMDP is defined by a large number of parameters that may be difficult to specify a priori from domain knowledge, and learning these parameters from the user may require an unacceptably long training period. We describe an extension to the POMDP model that allows the agent to acquire a linguistic model of the user online, including new vocabulary and word choice preferences. Our approach not only avoids a training period of constant questioning as the agent learns, but also allows the agent to actively query for additional information when its uncertainty suggests a high risk of mistakes. We demonstrate our approach both in simulation and on a natural language interaction system for a robotic wheelchair application.

#### Prediction on Spike Data Using Kernel Algorithms

Jan Eichhorn, Andreas S. Tolias, Alexander Zien, Malte Kuß, Carl Edward Rasmussen, Jason Weston, Nikos K. Logothetis, Bernhard Schölkopf, 2004. (In Advances in Neural Information Processing Systems 16). Edited by Sebastian Thrun, Lawrence K. Saul, Bernhard Schölkopf. Cambridge, MA, USA. Vancouver, BC, Canada. The MIT Press.

Abstract▼ URL

We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms.

#### Variational Inference for Bayesian Mixtures of Factor Analysers

Zoubin Ghahramani, Matthew J. Beal, 1999. (In NIPS). Edited by Michael J. Kearns, Sara A. Solla, David A. Cohn. The MIT Press. **ISBN**: 0-262-11245-0.

Abstract▼ URL

We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to overfitting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples. By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exact predictive density, and the KL divergence between the variational posterior and the true posterior, not only in this model but for variational approximations in general.

#### Data, modelling and inference in road traffic networks

Richard J. Gibbens, Yunus Saatçi, June 2008. (Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences). **DOI**: 10.1098/rsta.2008.0020.

Abstract▼ URL

In this paper, we study UK road traffic data and explore a range of modelling and inference questions that arise from them. For example, loop detectors on the M25 motorway record speed and flow measurements at regularly spaced locations as well as the entry and exit lanes of junctions. An exploratory study of these data helps us to better understand and quantify the nature of congestion on the road network. From a traveller’s perspective it is crucially important to understand the overall journey times and we look at methods to improve our ability to predict journey times given access jointly to both real-time and historical loop detector data. Throughout this paper we will comment on related work derived from US freeway data.

#### Modelling Spikes with Mixtures of Factor Analysers

Dilan Görür, Carl Edward Rasmussen, Andreas S. Tolias, Fabian Sinz, Nikos K. Logothetis, 09 2004. (In DAGM 2004). Edited by C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, M. A. Giese. (Pattern Recognition: Proceedings of the 26th DAGM Symposium). Berlin, Germany. Tübingen, Germany. Springer. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

Identifying the action potentials of individual neurons from extracellular recordings, known as spike sorting, is a challenging problem. We consider the spike sorting problem using a generative model,mixtures of factor analysers, which concurrently performs clustering and feature extraction. The most important advantage of this method is that it quantifies the certainty with which the spikes are classified. This can be used as a means for evaluating the quality of clustering and therefore spike isolation. Using this method, nearly simultaneously occurring spikes can also be modelled which is a hard task for many of the spike sorting methods. Furthermore, modelling the data with a generative model allows us to generate simulated data.

#### Ranking Biomedical Passages for Relevance and Diversity: University of Wisconsin, Madison at TREC Genomics 2006

A. B. Goldberg, D. Andrzejewski, J. Van Gael, B. Settles, X. Zhu, M. Craven, November 2006. (In Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006)). Gaithersburg, MD, USA.

Abstract▼ URL

We report on the University of Wisconsin, Madison’s experience in the TREC Genomics 2006 track, which asks participants to retrieve passages from scientific articles that satisfy biologists’ information needs. An emphasis is placed on returning relevant passages that discuss different aspects of the topic. Using an off-the-shelf information retrieval (IR) engine, we focused on query generation and reranking query results to encourage relevance and diversity. For query generation, we automatically identify noun phrases from the topic descriptions, and use online resources to gather synonyms as expansion terms. Our first submission uses the baseline IR engine results. We rerank the passages using a naive clustering-based approach in our second run, and we test GRASSHOPPER, a novel graph-theoretic algorithm based on absorbing random walks, in our third run. While our aspect-level results appear to compare favorably with other participants on average, our query generation techniques failed to produce adequate query results for several topics, causing our passage and document-level evaluation scores to suffer. Furthermore, we surprisingly achieved higher aspect-level scores using the initial ranking than our methods aimed specifically at promoting diversity. While this sounds discouraging, we have several ideas as to why this happened and hope to produce new methods that correct these shortcomings.

#### Improving diversity in ranking using absorbing random walks

A.B. Goldberg, X. Zhu, J. Van Gael, D. Andrzejewski, April 2007. (In Proceedings of NAACL HLT). Rochester, NY, USA.

#### Predictive Entropy Search for Bayesian Optimization with Unknown Constraints

José Miguel Hernández-Lobato, Michael A. Gelbart, Matthew W. Hoffman, Ryan P. Adams, Zoubin Ghahramani, 2015. (In 32nd International Conference on Machine Learning).

Abstract▼ URL

Unknown constraints arise in many types of expensive black-box optimization problems. Several methods have been proposed recently for performing Bayesian optimization with constraints, based on the expected improvement (EI) heuristic. However, EI can lead to pathologies when used with constraints. For example, in the case of decoupled constraints—i.e., when one can independently evaluate the objective or the constraints—EI can encounter a pathology that prevents exploration. Additionally, computing EI requires a current best solution, which may not exist if none of the data collected so far satisfy the constraints. By contrast, information-based approaches do not suffer from these failure modes. In this paper, we present a new information-based method called Predictive Entropy Search with Constraints (PESC). We analyze the performance of PESC and show that it compares favorably to EI-based approaches on synthetic and benchmark problems, as well as several real-world examples. We demonstrate that PESC is an effective algorithm that provides a promising direction towards a unified solution for constrained Bayesian optimization.

#### Statistical Fitting of Undrained Strength Data

Neil Houlsby, Guy Houlsby, 2013. (Geotechnique). Telford. **DOI**: 10.1680/geot.13.P.007.

Abstract▼ URL

We describe an approach, based on Bayesian statistical methods, that allows the fitting of a design profile to a set of measurements of undrained strengths. In particular we allow for the automatic determination of not only the positions of boundaries between geological units, but also the selection of the number of units to model the data in an appropriate way.

#### A Kernel Approach to Tractable Bayesian Nonparametrics

Ferenc Huszár, Simon Lacoste-Julien, 2011. University of Cambridge,

Abstract▼ URL

Inference in popular nonparametric Bayesian models typically relies on sampling or other approximations. This paper presents a general methodology for constructing novel tractable nonparametric Bayesian methods by applying the kernel trick to inference in a parametric Bayesian model. For example, Gaussian process regression can be derived this way from Bayesian linear regression. Despite the success of the Gaussian process framework, the kernel trick is rarely explicitly considered in the Bayesian literature. In this paper, we aim to fill this gap and demonstrate the potential of applying the kernel trick to tractable Bayesian parametric models in a wider context than just regression. As an example, we present an intuitive Bayesian kernel machine for density estimation that is obtained by applying the kernel trick to a Gaussian generative model in feature space.

**Comment:** arXiv:1103.1761

#### Mind reading by machine learning: A doubly Bayesian method for inferring mental representations

Ferenc Huszár, Uta Noppeney, Máté Lengyel, August 2010. (In The Proceedings of the 32nd Annual Meeting of the Cognitive Science Society). Edited by S. Ohlsson, R. Catrambone. Austin, TX, USA. The Cognitive Science Society.

Abstract▼ URL

A central challenge in cognitive science is to measure and quantify the mental representations humans develop — in other words, to ‘read’ subject’s minds. In order to eliminate potential biases in reporting mental contents due to verbal elaboration, subjects’ responses in experiments are often limited to binary decisions or discrete choices that do not require conscious reflection upon their mental contents. However, it is unclear what such impoverished data can tell us about the potential richness and dynamics of subjects’ mental representations. To address this problem, we used ideal observer models that formalise choice behaviour as (quasi-)Bayes-optimal, given subjects’ representations in long-term memory, acquired through prior learning, and the stimuli currently available to them. Bayesian inversion of such ideal observer models allowed us to infer subjects’ mental representation from their choice behaviour in a variety of psychophysical tasks. The inferred mental representations also allowed us to predict future choices of subjects with reasonable accuracy, even in tasks that were different from those in which the representations were estimated. These results demonstrate a significant potential in standard binary decision tasks to recover detailed information about subjects’ mental representations

**Comment:** Supplementary material available here.

#### Learning with Multiple Labels

Rong Jin, Zoubin Ghahramani, 2002. (In NIPS). Edited by Suzanna Becker, Sebastian Thrun, Klaus Obermayer. MIT Press. **ISBN**: 0-262-02550-7.

Abstract▼ URL

In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naïve methods degrade rapidly as more ambiguity is introduced into the labels.

#### Bayesian Knowledge Corroboration with Logical Rules and User Feedback

G. Kasneci, J. Van Gael, T. Graepel, R. Herbrich, September 2010. (In European Conference on Machine Learning (ECML)). Barcelona, Spain.

Abstract▼ URL

Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS [1]. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.

#### Bayesian Classifier Combination

Hyun-Chul Kim, Zoubin Ghahramani, 2012. (In 15th International Conference on Artificial Intelligence and Statistics).

Abstract▼ URL

Bayesian model averaging linearly mixes the probabilistic predictions of multiple models, each weighted by its posterior probability. This is the coherent Bayesian way of combining multiple models only under certain restrictive assumptions, which we outline. We explore a general framework for Bayesian model combination (which differs from model averaging) in the context of classification. This framework explicitly models the relationship between each model’s output and the unknown true label. The framework does not require that the models be probabilistic (they can even be human assessors), that they share prior information or receive the same training data, or that they be independent in their errors. Finally, the Bayesian combiner does not need to believe any of the models is in fact correct. We test several variants of this classifier combination procedure starting from a classic statistical model proposed by Dawid and Skene (1979) and using MCMC to add more complex but important features to the model. Comparisons on sev- eral data sets to simpler methods like majority voting show that the Bayesian methods not only perform well but result in interpretable diagnostics on the data points and the models.

#### Gender Classification with Bayesian Kernel Methods

Hyun-Chul Kim, Daijin Kim, Zoubin Ghahramani, Sung Yang Bang, 2006. (In IJCNN). Edited by William W. Cohen, Andrew Moore. Association for Computing Machinery. ACM International Conference Proceeding Series. **ISBN**: 1-59593-383-2.

Abstract▼ URL

We consider the gender classification task of discriminating between images of faces of men and women from face images. In appearance-based approaches, the initial images are preprocessed (e.g. normalized) and input into classifiers. Recently, SVMs which are popular kernel classifiers have been applied to gender classification and have shown excellent performance. We propose to use one of Bayesian kernel methods which is Gaussian process classifiers (GPCs) for gender classification. The main advantage of Bayesian kernel methods such as GPCs over SVMs is that they determine the hyperparameters of the kernel based on Bayesian model selection criterion. Our results show that GPCs outperformed SVMs with cross validation.

#### Goal Misgeneralization in Deep Reinforcement Learning

Lauro Langosco di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger, 2022. (In icml2022).

#### Kronecker Graphs: An Approach to Modeling Networks

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, Z. Ghahramani, 2010. (Journal of Machine Learning Research).

Abstract▼ URL

How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”.First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks.We then present KRONFIT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, KRONFIT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that KRONFIT finds accurate parameters that very well mimic the properties of target networks. In fact, using just four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synthetic graphs can be used for null-models, anonymization, extrapolations, and graph summarization.

#### Randomized Nonlinear Component Analysis

David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, Bernhard Schölkopf, 2014. (In ICML). JMLR.org. JMLR Proceedings.

Abstract▼ URL

Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.

#### Generalised Bayesian Matrix Factorisation Models

Shakir Mohamed, 2011. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others. This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models. The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression. Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation.

#### Bayesian Exponential Family PCA

Shakir Mohamed, Katherine A. Heller, Zoubin Ghahramani, December 2009. (In Advances in Neural Information Processing Systems 21). Edited by D. Koller, D. Schuurmans, Y. Bengio, L. Bottou. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.

**Comment:** spotlight.

#### Bayesian and L1 Approaches for Sparse Unsupervised Learning

Shakir Mohamed, Katherine A. Heller, Zoubin Ghahramani, 2012. (In 29th International Conference on Machine Learning).

Abstract▼ URL

The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of “L1”, and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods out-perform L1 minimisation, even on a com- putational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

#### Projective limit random probabilities on Polish spaces

Peter Orbanz, 2011. (Electron. J. Stat.).

Abstract▼ URL

A pivotal problem in Bayesian nonparametrics is the construction of prior distributions on the space M(V) of probability measures on a given domain V. In principle, such distributions on the infinite-dimensional space M(V) can be constructed from their finite-dimensional marginals—the most prominent example being the construction of the Dirichlet process from finite-dimensional Dirichlet distributions. This approach is both intuitive and applicable to the construction of arbitrary distributions on M(V), but also hamstrung by a number of technical difficulties. We show how these difficulties can be resolved if the domain V is a Polish topological space, and give a representation theorem directly applicable to the construction of any probability distribution on M(V) whose first moment measure is well-defined. The proof draws on a projective limit theorem of Bochner, and on properties of set functions on Polish spaces to establish countable additivity of the resulting random probabilities.

#### Bayesian Nonparametric Models

Peter Orbanz, Yee-Whye Teh, 2010. (In Encyclopedia of Machine Learning). Springer.

#### Model-based design analysis and yield optimization

Tobias Pfingsten, Daniel Herrmann, Carl Edward Rasmussen, 2006. (IEEE Transactions on Semiconductor Manufacturing). **DOI**: 10.1109/TSM.2006.883589.

Abstract▼ URL

Fluctuations are inherent to any fabrication process. Integrated circuits and microelectromechanical systems are particularly affected by these variations, and due to high-quality requirements the effect on the devices’ perform ance has to be understood quantitatively. In recent years, it has become possible to model the performance of such complex systems on the basis of design specifications, and model-based sensitivity analysis has made its way into industrial engineering. We show how an efficient Bayesian approach, using a Gaussian process prior, can replace the commonly used brute-force Monte Carlo scheme, making it possible to apply the analysis to computationally costly models. We introduce a number of global, statistically justified sensitivity measures for design analysis and optimization. Two models of integrated systems serve us as case studies to introduce the analysis and to assess its convergence properties. We show that the Bayesian Monte Carlo scheme can save costly simulation runs and can ensure a reliable accuracy of the analysis.

**Comment:** Winner of the 2006 Best Paper Award for the journal.

#### Copula-based Kernel Dependency Measures

Barnabas Poczos, Zoubin Ghahramani, Jeff Schneider, 2012. (In 29th International Conference on Machine Learning).

Abstract▼ URL

The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions.

#### Propagation of Uncertainty in Bayesian Kernel Models - Application to Multiple-Step Ahead Forecasting

Joaquin Quiñonero-Candela, Agathe Girard, Jan Larsen, Carl Edward Rasmussen, 2003. (In NNSP 2003). Edited by C. Molina, T. Adali, J. Larsen, M. Van Hulle, S. C. Douglas, J. Rouat. (Proceedings of 2003 IEEE International Workshop on Neural Networks for Signal Processing). Piscataway, New Jersey. Toulouse. IEEE Press.

Abstract▼ URL

The object of Bayesian modelling is the predictive distribution, which in a forecasting scenario enables improved estimates of forecasted values and their uncertainties. In this paper we focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian Process and the Relevance Vector Machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.

**Comment:** Electronic version of Quiñonero-Candela, Girard, Larsen and Rasmussen, 2003 which should have been presented at ICASSP 03, but was cancelled due to bird flu epidemic.

#### Evaluating Predictive Uncertainty Challenge

Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier Bousquet, Bernhard Schölkopf, 04 2006. (In Machine Learning Challenges. Evaluating predictive uncertainty, visual object classification and recognising tectual entailment. First PASCAL Machine Learning Challenges Workshop). Edited by J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d'Alché-Buc. Berlin, Germany. Southampton, United Kingdom. Springer. Lecture Notes in Computer Science (LNCS). **DOI**: 10.1007/11736790_1.

Abstract▼ URL

This Chapter presents the PASCAL1 Evaluating Predictive Uncertainty Challenge, introduces the contributed Chapters by the participants who obtained outstanding results, and provides a discussion with some lessons to be learnt. The Challenge was set up to evaluate the ability of Machine Learning algorithms to provide good “probabilistic predictions”, rather than just the usual “point predictions” with no measure of uncertainty, in regression and classification problems. Participants had to compete on a number of regression and classification tasks, and were evaluated by both traditional losses that only take into account point predictions and losses we proposed that evaluate the quality of the probabilistic predictions.

#### The Inverse Regression Topic Model

Maxim Rabinovich, David M. Blei, 2014. (In 31st International Conference on Machine Learning).

Abstract▼

Recently, multinomial inverse regression (MNIR) has been proposed as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation algorithm and an online variant, which is suitable for large corpora. We apply these methods to a corpus of 73K Congressional press releases and another of 150K Yelp reviews, demonstrating that the IRTM outperforms both MNIR and supervised topic models on the prediction task. Further, we give examples showing that the IRTM enables systematic discovery of in-topic lexical variation, which is not possible with previous supervised topic models.

#### Pattern Recognition: 26th DAGM Symposium

August 2004. Edited by Carl Edward Rasmussen, Heinrich H. Bülthoff, Martin A. Giese, Bernhard Schölkopf. Berlin, Germany. Tübingen, Germany. Springer. Lecture Notes in Computer Science (LNCS). **DOI**: 10.1007/b99676.

#### The DELVE manual

Carl Edward Rasmussen, Radford M. Neal, Geoffrey E. Hinton, Drew van Camp, Mike Revow, Zoubin Ghahramani, Rafal Kustra, Robert Tibshirani, 1996.

Abstract▼ URL

DELVE – Data for Evaluating Learning in Valid Experiments – is a collection of datasets from many sources, an environment within which this data can be used to assess the performance of methods for learning relationships from data, and a repository for the results of such experiments.

**Comment:** The delve website.

#### Presynaptic and postsynaptic comptetition in models for the development of neuromuscular connections

Carl Edward Rasmussen, David J. Willshaw, 1993. (Biological Cybernetics). Springer. **DOI**: 10.1007/BF00198773.

Abstract▼ URL

In the establishment of connections between nerve and muscle there is an initial stage when each muscle fibre is innervated by several different motor axons. Withdrawal of connections then takes place until each fibre has contact from just a single axon. The evidence suggests that the withdrawal process involves competition between nerve terminals. We examine in formal models several types of competitive mechanism that have been proposed for this phenomenon. We show that a model which combines competition for a presynaptic resource with competition for a postsynaptic resource is superior to others. This model accounts for many anatomical and physiological findings and has a biologically plausible implementation. Intrinsic withdrawal appears to be a side effect of the competitive mechanism rather than a separate non-competitive feature. The model’s capabilities are confirmed by theoretical analysis and full scale computer simulations.

#### A causal perspective on domain adaptation

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, Jonas Peters, 2015. (arXiv preprint arXiv:1507.05333)).

Abstract▼ URL

From training data from several related domains (or tasks), methods of domain adaptation try to combine knowledge to improve performance. This paper discusses an approach to domain adaptation which is inspired by a causal interpretation of the multi-task problem. We assume that a covariate shift assumption holds true for a subset of predictor variables: the conditional of the target variable given this subset of predictors is invariant with respect to shifts in those predictors (covariates). We propose to learn the corresponding conditional expectation in the training domains and use it for estimation in the target domain. We further introduce a method which allows for automatic inference of the above subset in regression and classification. We study the performance of this approach in an adversarial setting, in the case where no additional examples are available in the test domain. If a labeled sample is available, we provide a method for using both the transferred invariant conditional and task specific information. We present results on synthetic data sets and a sentiment analysis problem.

#### Function factorization using warped Gaussian processes

Mikkel N. Schmidt, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

We introduce a new approach to non-linear regression called function factorization, that is suitable for problems where an output variable can reasonably be modeled by a number of multiplicative interaction terms between non-linear functions of the inputs. The idea is to approximate a complicated function on a high-dimensional space by the sum of products of simpler functions on lower-dimensional subspaces. Function factorization can be seen as a generalization of matrix and tensor factorization methods, in which the data are approximated by the sum of outer products of vectors. We present a non-parametric Bayesian approach to function factorization where the priors over the factorizing functions are warped Gaussian processes, and we do inference using Hamiltonian Markov chain Monte Carlo. We demonstrate the superior predictive performance of the method on a food science data set compared to Gaussian process regression and tensor factorization using PARAFAC and GEMANOVA models.

#### Linearly constrained Bayesian matrix factorization for blind source separation

Mikkel N. Schmidt, December 2009. (In Advances in Neural Information Processing Systems 22). Edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

We present a general Bayesian approach to probabilistic matrix factorization subject to linear constraints. The approach is based on a Gaussian observation model and Gaussian priors with bilinear equality and inequality constraints. We present an efficient Markov chain Monte Carlo inference procedure based on Gibbs sampling. Special cases of the proposed model are Bayesian formulations of non-negative matrix factorization and factor analysis. The method is evaluated on a blind source separation problem. We demonstrate that our algorithm can be used to extract meaningful and interpretable features that are remarkably different from features extracted using existing related matrix factorization techniques.

**Comment:** code.

#### Probabilistic non-negative tensor factorization using Markov chain Monte Carlo

Mikkel N. Schmidt, Shakir Mohamed, August 2009. (In European Signal Processing Conference (EUSIPCO)). Glasgow, Scotland.

Abstract▼ URL

We present a probabilistic model for learning non-negative tensor factorizations (NTF), in which the tensor factors are latent variables associated with each data dimension. The non-negativity constraint for the latent factors is handled by choosing priors with support on the non-negative numbers. Two Bayesian inference procedures based on Markov chain Monte Carlo sampling are described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We evaluate the model on two food science data sets, and show that the probabilistic NTF model leads to better predictions and avoids overfitting compared to existing NTF approaches.

**Comment:** Rated by reviewers amongst the top 5% of the presented papers.

#### Bayesian non-negative matrix factorization

Mikkel N. Schmidt, Ole Winther, Lars Kai Hansen, March 2009. (In 8th International Conference on Independent Component Analysis and Signal Separation). Paraty, Brazil. Springer. Lecture Notes in Computer Science (LNCS).

Abstract▼ URL

We present a Bayesian treatment of non-negative matrix factorization (NMF), based on a normal likelihood and exponential priors, and derive an efficient Gibbs sampler to approximate the posterior density of the NMF factors. On a chemical brain imaging data set, we show that this improves interpretability by providing uncertainty estimates. We discuss how the Gibbs sampler can be used for model order selection by estimating the marginal likelihood, and compare with the Bayesian information criterion. For computing the maximum a posteriori estimate we present an iterated conditional modes algorithm that rivals existing state-of-the-art NMF algorithms on an image feature extraction problem.

#### Formally justified and modular Bayesian inference for probabilistic programs

Adam Ścibior, 2019. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

Probabilistic modelling offers a simple and coherent framework to describe the real world in the face of uncertainty. Furthermore, by applying Bayes’ rule it is possible to use probabilistic models to make inferences about the state of the world from partial observations. While traditionally probabilistic models were constructed on paper, more recently the approach of probabilistic programming enables users to write the models in executable languages resembling computer programs and to freely mix them with deterministic code. It has long been recognised that the semantics of programming languages is complicated and the intuitive understanding that programmers have is often inaccurate, resulting in difficult to understand bugs and unexpected program behaviours. Programming languages are therefore studied in a rigorous way using formal languages with mathematically defined semantics. Traditionally formal semantics of probabilistic programs are defined using exact inference results, but in practice exact Bayesian inference is not tractable and approximate methods are used instead, posing a question of how the results of these algorithms relate to the exact results. Correctness of such approximate methods is usually argued somewhat less rigorously, without reference to a formal semantics. In this dissertation we formally develop denotational semantics for probabilistic programs that correspond to popular sampling algorithms often used in practice. The semantics is defined for an expressive typed lambda calculus with higher-order functions and inductive types, extended with probabilistic effects for sampling and conditioning, allowing continuous distributions and unbounded likelihoods. It makes crucial use of the recently developed formalism of quasi-Borel spaces to bring all these elements together. We provide semantics corresponding to several variants of Markov chain Monte Carlo and Sequential Monte Carlo methods and formally prove a notion of correctness for these algorithms in the context of probabilistic programming. We also show that the semantic construction can be directly mapped to an implementation using established functional programming abstractions called monad transformers. We develop a compact Haskell library for probabilistic programming closely corresponding to the semantic construction, giving users a high level of assurance in the correctness of the implementation. We also demonstrate on a collection of benchmarks that the library offers performance competitive with existing systems of similar scope. An important property of our construction, both the semantics and the implementation, is the high degree of modularity it offers. All the inference algorithms are constructed by combining small building blocks in a setup where the type system ensures correctness of compositions. We show that with basic building blocks corresponding to vanilla Metropolis-Hastings and Sequential Monte Carlo we can implement more advanced algorithms known in the literature, such as Resample-Move Sequential Monte Carlo, Particle Marginal Metropolis-Hastings, and Sequential Monte Carlo squared. These implementations are very concise, reducing the effort required to produce them and the scope for bugs. On top of that, our modular construction enables in some cases deterministic testing of randomised inference algorithms, further increasing reliability of the implementation.

#### Practical Probabilistic Programming with Monads

Adam Ścibior, Zoubin Ghahramani, Andrew D. Gordon, 2015. (In Proceedings of the 8th ACM SIGPLAN Symposium on Haskell). Association for Computing Machinery. **DOI**: 10.1145/2804302.2804317.

Abstract▼ URL

The machine learning community has recently shown a lot of interest in practical probabilistic programming systems that target the problem of Bayesian inference. Such systems come in different forms, but they all express probabilistic models as computational processes using syntax resembling programming languages. In the functional programming community monads are known to offer a convenient and elegant abstraction for programming with probability distributions, but their use is often limited to very simple inference problems. We show that it is possible to use the monad abstraction to construct probabilistic models for machine learning, while still offering good performance of inference in challenging models. We use a GADT as an underlying representation of a probability distribution and apply Sequential Monte Carlo-based methods to achieve efficient inference. We define a formal semantics via measure theory. We demonstrate a clean and elegant implementation that achieves performance comparable with Anglican, a state-of-the-art probabilistic programming system.

#### Functional programming for modular Bayesian inference

Adam Ścibior, Ohad Kammar, Zoubin Ghahramani, 2018. (Proceedings of the ACM on Programming Languages).

Abstract▼ URL

We present an architectural design of a library for Bayesian modelling and inference in modern functional programming languages. The novel aspect of our approach are modular implementations of existing state-of-the-art inference algorithms. Our design relies on three inherently functional features: higher-order functions, inductive data-types, and support for either type-classes or an expressive module system. We provide a performant Haskell implementation of this architecture, demonstrating that high-level and modular probabilistic programming can be added as a library in sufficiently expressive languages. We review the core abstractions in this architecture: inference representations, inference transformations, and inference representation transformers. We then implement concrete instances of these abstractions, counterparts to particle filters and Metropolis-Hastings samplers, which form the basic building blocks of our library. By composing these building blocks we obtain state-of-the-art inference algorithms: Resample-Move Sequential Monte Carlo, Particle Marginal Metropolis-Hastings, and Sequential Monte Carlo Squared. We evaluate our implementation against existing probabilistic programming systems and find it is already competitively performant, although we conjecture that existing functional programming optimisation techniques could reduce the overhead associated with the abstractions we use. We show that our modular design enables deterministic testing of inherently stochastic Monte Carlo algorithms. Finally, we demonstrate using OCaml that an expressive module system can also implement our design.

#### Denotational Validation of Higher-Order Bayesian Inference

Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang, Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, Zoubin Ghahramani, 2018. (Proceedings of the ACM on Programming Languages).

Abstract▼ URL

We present a modular semantic account of Bayesian inference algorithms for probabilistic programming languages, as used in data science and machine learning. Sophisticated inference algorithms are often explained in terms of composition of smaller parts. However, neither their theoretical justification nor their implementation reflects this modularity. We show how to conceptualise and analyse such inference algorithms as manipulating intermediate representations of probabilistic programs using higher-order functions and inductive types, and their denotational semantics. Semantic accounts of continuous distributions use measurable spaces. However, our use of higher-order functions presents a substantial technical difficulty: it is impossible to define a measurable space structure over the collection of measurable functions between arbitrary measurable spaces that is compatible with standard operations on those functions, such as function application. We overcome this difficulty using quasi-Borel spaces, a recently proposed mathematical structure that supports both function spaces and continuous distributions. We define a class of semantic structures for representing probabilistic programs, and semantic validity criteria for transformations of these representations in terms of distribution preservation. We develop a collection of building blocks for composing representations. We use these building blocks to validate common inference algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To emphasize the connection between the semantic manipulation and its traditional measure theoretic origins, we use Kock’s synthetic measure theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to the Metropolis-Hastings-Green theorem.

#### Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker, 2022. (arXiv preprint arXiv:2209.10015).

Abstract▼ URL

Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy’ or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology – uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

**Comment:** Project webpage: https://metadata-archaeology.github.io/

#### Defining and Characterizing Reward Hacking

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, David Krueger, 2022. (In Advances in Neural Information Processing Systems 35).

#### Compressing Sets and Multisets of Sequences

Christian Steinruecken, March 2015. (IEEE Transactions on Information Theory). IEEE. **DOI**: 10.1109/TIT.2015.2392093. **ISSN**: 0018-9448. **Note**: A previous version was published at the Data Compression Conference 2014..

Abstract▼ URL

This article describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset’s unordered structure. Multisets are a generalisation of sets where members are allowed to occur multiple times. A multiset can be encoded naively by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.

#### Improving PPM with dynamic parameter updates

Christian Steinruecken, Zoubin Ghahramani, David MacKay, April 2015. (In Proceedings of the Data Compression Conference). Edited by Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, James A. Storer. Snowbird, UT, USA. IEEE Computer Society. **ISSN**: 1068-0314.

Abstract▼ URL

This article makes several improvements to the classic PPM algorithm, resulting in a new algorithm with superior compression effectiveness on human text. The key differences of our algorithm to classic PPM are that (A) rather than the original escape mechanism, we use a generalised blending method with explicit hyper-parameters that control the way symbol counts are combined to form predictions; (B) different hyper-parameters are used for classes of different contexts; and (C) these hyper-parameters are updated dynamically using gradient information. The resulting algorithm (PPM-DP) compresses human text better than all currently published variants of PPM, CTW, DMC, LZ, CSE and BWT, with runtime only slightly slower than classic PPM.

#### Compressing combinatorial objects

Christian Steinruecken, January 2016.

Abstract▼ URL

Most of the world’s digital data is currently encoded in a sequential form, and compression methods for sequences have been studied extensively. However, there are many types of non-sequential data for which good compression techniques are still largely unexplored. This paper contributes insights and concrete techniques for compressing various kinds of non-sequential data via arithmetic coding, and derives re-usable probabilistic data models from fairly generic structural assumptions. Near-optimal compression methods are described for certain types of permutations, combinations and multisets; and the conditions for optimality are made explicit for each method.

#### U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

JaeMo Sung, Sung Yang Bang, Seungjin Choi, Zoubin Ghahramani, 2005. (In ECML). Edited by João Gama, Rui Camacho, Pavel Brazdil, Alípio Jorge, Luís Torgo. Springer. Lecture Notes in Computer Science. **ISBN**: 3-540-29243-8.

Abstract▼ URL

In this paper we consider latent variable models and introduce a new U-likelihood concept for estimating the distribution over hidden variables. One can derive an estimate of parameters from this distribution. Our approach differs from the Bayesian and Maximum Likelihood (ML) approaches. It gives an alternative to Bayesian inference when we don’t want to define a prior over parameters and gives an alternative to the ML method when we want a better estimate of the distribution over hidden variables. As a practical implementation, we present a U-updating algorithm based on the mean field theory to approximate the distribution over hidden variables from the U-likelihood. This algorithm captures some of the correlations among hidden variables by estimating reaction terms. Those reaction terms are found to penalize the likelihood. We show that the U-updating algorithm becomes the EM algorithm as a special case in the large sample limit. The useful behavior of our method is confirmed for the case of mixture of Gaussians by comparing to the EM algorithm.

#### Backward-Compatible Prediction Updates: A Probabilistic Approach

F. Träuble, J. von Kügelgen, M. Kleindessner, F. Locatello, B. Schölkopf, P. Gehler, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

When machine learning systems meet real world applications, accuracy is only one of several requirements. In this paper, we assay a complementary perspective originating from the increasing availability of pre-trained and regularly improving state-of-the-art models. While new improved models develop at a fast pace, downstream tasks vary more slowly or stay constant. Assume that we have a large unlabelled data set for which we want to maintain accurate predictions. Whenever a new and presumably better ML models becomes available, we encounter two problems: (i) given a limited budget, which data points should be re-evaluated using the new model?; and (ii) if the new predictions differ from the current ones, should we update? Problem (i) is about compute cost, which matters for very large data sets and models. Problem (ii) is about maintaining consistency of the predictions, which can be highly relevant for downstream applications; our demand is to avoid negative flips, i.e., changing correct to incorrect predictions. In this paper, we formalize the Prediction Update Problem and present an efficient probabilistic approach as answer to the above questions. In extensive experiments on standard classification benchmark data sets, we show that our method outperforms alternative strategies along key metrics for backward-compatible prediction updates.

#### Bayesian model search for mixture models based on optimizing variational bounds

Naonori Ueda, Zoubin Ghahramani, 2002. (Neural Networks).

Abstract▼ URL

When learning a mixture model, we suffer from the local optima and model structure determination problems. In this paper, we present a method for simultaneously solving these problems based on the variational Bayesian (VB) framework. First, in the VB framework, we derive an objective function that can simultaneously optimize both model parameter distributions and model structure. Next, focusing on mixture models, we present a deterministic algorithm to approximately optimize the objective function by using the idea of the split and merge operations which we previously proposed within the maximum likelihood framework. Then, we apply the method to mixture of expers (MoE) models to experimentally show that the proposed method can find the optimal number of experts of a MoE while avoiding local maxima. q 2002 Elsevier Science Ltd. All rights reserved.

#### Correlation clustering for crosslingual link detection

J. Van Gael, X. Zhu, January 2007. (In International Joint Conference on Artificial Intelligence). Edited by Manuela M. Veloso. Hyderabad, India.

Abstract▼ URL

The crosslingual link detection problem calls for identifying news articles in multiple languages that report on the same news event. This paper presents a novel approach based on constrained clustering. We discuss a general way for constrained clustering using a recent, graph-based clustering framework called correlation clustering. We introduce a correlation clustering implementation that features linear program chunking to allow processing larger datasets. We show how to apply the correlation clustering algorithm to the crosslingual link detection problem and present experimental results that show correlation clustering improves upon the hierarchical clustering approaches commonly used in link detection, and, hierarchical clustering approaches that take constraints into account.

#### Dirichlet process mixture models for verb clustering

Andreas Vlachos, Zoubin Ghahramani, Anna Korhonen, 2008. (In Proceedings of the ICML workshop on Prior Knowledge for Text and Language).

Abstract▼

In this work we apply Dirichlet Process Mixture Models to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We assess the performance on a dataset based on Levin’s (1993) verb classes using the recently introduced V- measure metric. In, we present a method to add human supervision to the model in order to to influence the solution with respect to some prior knowledge. The quantitative evaluation performed highlights the benefits of the chosen method compared to previously used clustering approaches.

#### Probabilistic Models for Data Combination in Recommender Systems

Sinead Williamson, Zoubin Ghahramani, 2008. (In Learning from Multiple Sources Workshop, NIPS Conference). Whistler Canada.

#### Provable lifelong learning of representations

Xinyuan Cao, Weiyang Liu, Santosh Vempala, 2022. (In International Conference on Artificial Intelligence and Statistics).

#### Iterative teaching by label synthesis

Weiyang Liu, Zhen Liu, Hanchen Wang, Liam Paull, Bernhard Schölkopf, Adrian Weller, 2021. (Advances in Neural Information Processing Systems).

#### Learning with hyperspherical uniformity

Weiyang Liu, Rongmei Lin, Zhen Liu, Li Xiong, Bernhard Schölkopf, Adrian Weller, 2021. (In International Conference On Artificial Intelligence and Statistics).

#### Orthogonal over-parameterized training

Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Liam Paull, Li Xiong, Le Song, Adrian Weller, 2021. (In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition).

#### Pre-training Molecular Graph Representation with 3D Geometry

Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, Jian Tang, 2022. (In International Conference on Learning Representations).

#### SphereFace Revived: Unifying Hyperspherical Face Recognition

Weiyang Liu, Yandong Wen, Bhiksha Raj, Rita Singh, Adrian Weller, 2022. (IEEE Transactions on Pattern Analysis and Machine Intelligence). IEEE.

#### Structural Causal 3D Reconstruction

Weiyang Liu, Zhen Liu, Liam Paull, Adrian Weller, Bernhard Schölkopf, 2022. (In European Conference on Computer Vision).

#### Self-supervised 3d face reconstruction via conditional estimation

Yandong Wen, Weiyang Liu, Bhiksha Raj, Rita Singh, 2021. (In Proceedings of the IEEE/CVF International Conference on Computer Vision).

#### SphereFace2: Binary Classification is All You Need for Deep Face Recognition

Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, Rita Singh, 2022. (In International Conference on Learning Representations).

#### Locality sensitive teaching

Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, Anshumali Shrivastava, 2021. (Advances in Neural Information Processing Systems).

#### Towards principled disentanglement for domain generalization

Hanlin Zhang, Yi-Fan Zhang, Weiyang Liu, Adrian Weller, Bernhard Schölkopf, Eric P Xing, 2022. (In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition).