## Publications

#### Converting to Optimization in Machine Learning: Perturb-and-MAP, Differential Privacy, and Program Synthesis

Matej Balog, 2020. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

On a mathematical level, most computational problems encountered in machine learning are instances of one of four abstract, fundamental problems: sampling, integration, optimization, and search. Thanks to the rich history of the respective mathematical fields, disparate methods with different properties have been developed for these four problem classes. As a result it can be beneficial to convert a problem from one abstract class into a problem of a different class, because the latter might come with insights, techniques, and algorithms well suited to the particular problem at hand. In particular, this thesis contributes four new methods and generalizations of existing methods for converting specific non-optimization machine learning tasks into optimization problems with more appealing properties. The first example is partition function estimation (an integration problem), where an existing algorithm – the Gumbel trick – for converting to the MAP optimization problem is generalized into a more general family of algorithms, such that other instances of this family have better statistical properties. Second, this family of algorithms is further generalized to another integration problem, the problem of estimating Rényi entropies. The third example shows how an intractable sampling problem arising when wishing to publicly release a database containing sensitive data in a safe (“differentially private”) manner can be converted into an optimization problem using the theory of Reproducing Kernel Hilbert Spaces. Finally, the fourth case study casts the challenging discrete search problem of program synthesis from input-output examples as a supervised learning task that can be efficiently tackled using gradient-based optimization. In all four instances, the conversions result in novel algorithms with desirable properties. In the first instance, new generalizations of the Gumbel trick can be used to construct statistical estimators of the partition function that achieve the same estimation error while using up to 40% fewer samples. The second instance shows that unbiased estimators of the Rényi entropy can be constructed in the Perturb-and-MAP framework. The main contribution of the third instance is theoretical: the conversion shows that it is possible to construct an algorithm for releasing synthetic databases that approximate databases containing sensitive data in a mathematically precise sense, and to prove results about their approximation errors. Finally, the fourth conversion yields an algorithm for synthesising program source code from input-output examples that is able to solve test problems 1-3 orders of magnitude faster than a wide range of baselines.

#### DeepCoder: Learning to Write Programs

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, Daniel Tarlow, April 2017. (In 5th International Conference on Learning Representations). Toulon, France.

Abstract▼ URL

We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network’s predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.

#### The Mondrian Kernel

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, Yee Whye Teh, June 2016. (In 32nd Conference on Uncertainty in Artificial Intelligence). Jersey City, New Jersey, USA.

Abstract▼ URL

We introduce the Mondrian kernel, a fast random feature approximation to the Laplace kernel. It is suitable for both batch and online learning, and admits a fast kernel-width-selection procedure as the random features can be re-used efficiently for all kernel widths. The features are constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and we highlight the connection to Mondrian forests [Lakshminarayanan et al., 2014], where trees are also sampled via a Mondrian process, but fit independently. This link provides a new insight into the relationship between kernel methods and random forests.

**Comment:** [Supplementary Material] [arXiv] [Poster] [Slides] [Code]

#### Fast training of sparse graph neural networks on dense hardware

Matej Balog, Bart van Merriënboer, Subhodeep Moitra, Yujia Li, Daniel Tarlow, 2019. (arXiv).

Abstract▼ URL

Graph neural networks have become increasingly popular in recent years due to their ability to naturally encode relational input data and their ability to scale to large graphs by operating on a sparse representation of graph adjacency matrices. As we look to scale up these models using custom hardware, a natural assumption would be that we need hardware tailored to sparse operations and/or dynamic control flow. In this work, we question this assumption by scaling up sparse graph neural networks using a platform targeted at dense computation on fixed-size data. Drawing inspiration from optimization of numerical algorithms on sparse matrices, we develop techniques that enable training the sparse graph neural network model from Allamanis et al. [2018] in 13 minutes using a 512-core TPUv2 Pod, whereas the original training takes almost a day.

#### Neural program synthesis with a differentiable fixer

Matej Balog, Rishabh Singh, Petros Maniatis, Charles Sutton, 2020. (arXiv).

Abstract▼ URL

We present a new program synthesis approach that combines an encoder-decoder based synthesis architecture with a differentiable program fixer. Our approach is inspired from the fact that human developers seldom get their program correct on the first attempt, and perform iterative testing-based program fixing to get to the desired program functionality. Similarly, our approach first learns a distribution over programs conditioned on an encoding of a set of input-output examples, and then iteratively performs fix operations using the differentiable fixer. The fixer takes as input the original examples and the current program’s outputs on example inputs, and generates a new distribution over the programs with the goal of reducing the discrepancies between the current program outputs and the desired example outputs. We train our architecture end-to-end on the RobustFill domain, and show that the addition of the fixer module leads to a significant improvement on synthesis accuracy compared to using beam search.

#### Differentially Private Database Release via Kernel Mean Embeddings

Matej Balog, Ilya Tolstikhin, Bernhard Schölkopf, July 2018. (In 35th International Conference on Machine Learning). Stockholm, Sweden.

Abstract▼ URL

We lay theoretical foundations for new database release mechanisms that allow third-parties to construct consistent estimators of population statistics, while ensuring that the privacy of each individual contributing to the database is protected. The proposed framework rests on two main ideas. First, releasing (an estimate of) the kernel mean embedding of the data generating random variable instead of the database itself still allows third-parties to construct consistent estimators of a wide class of population statistics. Second, the algorithm can satisfy the definition of differential privacy by basing the released kernel mean embedding on entirely synthetic data points, while controlling accuracy through the metric available in a Reproducing Kernel Hilbert Space. We describe two instantiations of the proposed framework, suitable under different scenarios, and prove theoretical results guaranteeing differential privacy of the resulting algorithms and the consistency of estimators constructed from their outputs.

**Comment:** [arXiv]

#### Lost Relatives of the Gumbel Trick

Matej Balog, Nilesh Tripuraneni, Zoubin Ghahramani, Adrian Weller, August 2017. (In 34th International Conference on Machine Learning). Sydney, Australia.

Abstract▼ URL

The Gumbel trick is a method to sample from a discrete probability distribution, or to estimate its normalizing partition function. The method relies on repeatedly applying a random perturbation to the distribution in a particular way, each time solving for the most likely configuration. We derive an entire family of related methods, of which the Gumbel trick is one member, and show that the new methods have superior properties in several settings with minimal additional computational cost. In particular, for the Gumbel trick to yield computational benefits for discrete graphical models, Gumbel perturbations on all configurations are typically replaced with so-called low-rank perturbations. We show how a subfamily of our new methods adapts to this setting, proving new upper and lower bounds on the log partition function and deriving a family of sequential samplers for the Gibbs distribution. Finally, we balance the discussion by showing how the simpler analytical form of the Gumbel trick enables additional theoretical results.