# Network Modelling

Techniques for analyzing and modeling complex networks, such as social networks, transportation systems, and communication networks.

#### Concrete dropout

Yarin Gal, Jiri Hron, Alex Kendall, 2017. (NeurIPS).

Abstract▼ URL

Dropout is used as a practical tool to obtain uncertainty estimates in large vision models and reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty estimates, a grid-search over the dropout probabilities is necessary—a prohibitive operation with large models, and an impossible one with RL. We propose a new dropout variant which gives improved performance and better calibrated uncertainties. Relying on recent developments in Bayesian deep learning, we use a continuous relaxation of dropout’s discrete masks. Together with a principled optimisation objective, this allows for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles. In RL this allows the agent to adapt its uncertainty dynamically as more data is observed. We analyse the proposed variant extensively on a range of tasks, and give insights into common practice in the field where larger dropout probabilities are often used in deeper model layers.

#### Exact posteriors of wide Bayesian neural networks

Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, 2020. (UDL (ICML workshop)).

Abstract▼ URL

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling.

#### Infinite attention: NNGP and NTK for deep attention networks

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak, 2020. (ICML).

Abstract▼ URL

There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.

#### Variational Bayesian dropout: pitfalls and fixes

Jiri Hron, Alexander G. D. G. Matthews, Zoubin Ghahramani, 2018. (ICML).

Abstract▼ URL

Dropout, a stochastic regularisation technique for training of neural networks, has recently been reinterpreted as a specific type of approximate inference algorithm for Bayesian neural networks. The main contribution of the reinterpretation is in providing a theoretical framework useful for analysing and extending the algorithm. We show that the proposed framework suffers from several issues; from undefined or pathological behaviour of the true posterior related to use of improper priors, to an ill-defined variational objective due to singularity of the approximating distribution relative to the true posterior. Our analysis of the improper log uniform prior used in variational Gaussian dropout suggests the pathologies are generally irredeemable, and that the algorithm still works only because the variational formulation annuls some of the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL) divergence, a new approximate inference objective for approximation of high-dimensional distributions. We show that motivations for variational Bernoulli dropout based on discretisation and noise have QKL as a limit. Properties of QKL are studied both theoretically and on a simple practical example which shows that the QKL-optimal approximation of a full rank Gaussian with a degenerate one naturally leads to the Principal Component Analysis solution.

#### Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, 2022. (ICML).

Abstract▼ URL

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN. This contrasts with the typically poor performance of MCMC in high dimensions. We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks. Improvements are achieved at all widths, with the margin between reparametrised and standard BNNs growing with layer width.

#### Unsupervised Many-to-Many Object Matching for Relational Data

Tomoharu Iwata, James Robert Lloyd, Zoubin Ghahramani, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence).

Abstract▼ URL

We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. The effectiveness of the proposed method is experimentally demonstrated by using synthetic and real relational data sets, which include applications to cross-domain recommendation without shared user/item identifiers and multi-lingual word clustering.

#### Complex interlinkages, key objectives and nexuses amongst the Sustainable Development Goals and climate change: a network analysis

F. Laumann, J. von Kügelgen, T. H. Kanashiro Uehara, M. Barahona, 2022. (The Lancet Planetary Health). **DOI**: 10.1016/S2542-5196(22)00070-5.

Abstract▼ URL

Background: Global sustainability is an enmeshed system of complex socioeconomic, climatological, and ecological interactions. The numerous objectives of the UN’s Sustainable Development Goals (SDGs) and the Paris Agreement have various levels of interdependence, making it difficult to ascertain the influence of changes to particular indicators across the whole system. In this analysis, we aimed to detect and rank the complex interlinkages between objectives of sustainability agendas. Methods: We developed a method to find interlinkages among the 17 SDGs and climate change, including non-linear and non-monotonic dependences. We used time series of indicators defined by the World Bank, consisting of 400 indicators that measure progress towards the 17 SDGs and an 18th variable (annual average temperatures), representing progress in the response to the climate crisis, from 2000 to 2019. This method detects significant dependencies among the time evolution of the objectives by using partial distance correlations, a non-linear measure of conditional dependence that also discounts spurious correlations originating from lurking variables. We then used a network representation to identify the most important objectives (using network centrality) and to obtain nexuses of objectives (defined as highly interconnected clusters in the network). Findings: Using temporal data from 181 countries spanning 20 years, we analysed dependencies among SDGs and climate for 35 country groupings based on region, development, and income level. The observed significant interlinkages, central objectives, and nexuses identified varied greatly across country groupings; however, SDG 17 (partnerships for the goals) and climate change ranked as highly important across many country groupings. Temperature rise was strongly linked to urbanisation, air pollution, and slum expansion (SDG 11), especially in country groupings likely to be worst affected by climate breakdown, such as Africa. In several country groupings composed of developing nations, we observed a consistent nexus of strongly interconnected objectives formed by SDG 1 (poverty reduction), SDG 4 (education), and SDG 8 (economic growth), sometimes incorporating SDG 5 (gender equality), and SDG 16 (peace and justice). Interpretation: The differences across groupings emphasise the need to define goals in accordance with local circumstances and priorities. Our analysis highlights global partnerships (SDG 17) as a pivot in global sustainability efforts, which have been strongly linked to economic growth (SDG 8). However, if economic growth and trade expansion were repositioned as a means instead of an end goal of development, our analysis showed that education (SDG 4) and poverty reduction (SDG 1) become more central, thus suggesting that these could be prioritised in global partnerships. Urban livelihoods (SDG 11) were also flagged as important to avoid replicating unsustainable patterns of the past.

#### Representation, learning, description and criticism of probabilistic models with applications to networks, functions and relational data

James Rovert Lloyd, 2015. University of Cambridge, Department of Engineering, Cambridge, UK.

Abstract▼ URL

This thesis makes contributions to a variety of aspects of probabilistic inference. When performing probabilistic inference, one must first represent one’s beliefs with a probability distribution. Specifying the details of a probability distribution can be a difficult task in many situations, but when expressing beliefs about complex data structures it may not even be apparent what form such a distribution should take. This thesis starts by demonstrating how representation theorems due to Aldous, Hoover and Kallenberg can be used to specify appropriate models for data in the form of networks. These theorems are then extended in order to reveal appropriate probability distributions for arbitrary relational data or databases. A simpler data structure to specify probability distributions for is that of functions; many probability distributions for functions have been used for centuries. We demonstrate that many of these distributions can be expressed in a common language of Gaussian process kernels constructed from a few base elements and operators. The structure of this language allows for the effective automatic construction of probabilistic models for functions. Furthermore, the formal mathematical language of kernels can be mapped neatly onto natural language allowing for automatic descriptions of the automatically constructed models. By further automating the construction of statistical models, the need to be able to effectively check or criticise these models becomes greater. This thesis demonstrates how kernel two sample tests can be used to demonstrate where a probabilistic model most disagrees with data allowing for targeted improvements to the model. In proposing a new method of model criticism this thesis also briefly discusses the philosophy of model criticism within the context of probabilistic inference.

#### Random function priors for exchangeable arrays with applications to graphs and relational data

James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases.

#### Gaussian process behaviour in wide deep neural networks

Alexander G. D. G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, Zoubin Ghahramani, 2018. (ICLR).

Abstract▼ URL

Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature.

#### Sample-then-optimise posterior sampling for Bayesian linear models

Alexander G. D. G. Matthews, Jiri Hron, Richard E. Turner, Zoubin Ghahramani, 2017. (AABI (NeurIPS workshop)).

Abstract▼ URL

In modern machine learning it is common to train models which have an extremely high intrinsic capacity. The results obtained are often i nitialization dependent, are different for disparate optimizers and in some cases have no explicit regularization. This raises difficult questions about generalization. A natural approach to questions of generalization is a Bayesian one. There is therefore a growing literature attempting to understand how Bayesian posterior inference could emerge from the complexity of modern practice, even without having such a procedure as the stated goal. In this work we consider a simple special case where exact Bayesian posterior sampling emerges from sampling (cf initialization) and then gradient descent. Specifically, for a Bayesian linear model, if we parameterize it in terms of a deterministic function of an isotropic normal prior, then the action of sampling from the prior followed by first order optimization of the squared loss will give a posterior sample. Although the assumptions are stronger than many real problems, it still exhibits the challenging properties of redundant model capacity and a lack of explicit regularizers, along with initialization and optimizer dependence. It is therefore an interesting controlled test case. Given its simplicity, the method itself may turn out to be of independent interest from our original goal.

#### Bayesian deep CNNs with many channels are Gaussian processes

Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein, 2019. (ICLR).

Abstract▼ URL

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit—a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

#### Neural Tangents: fast and easy infinite networks in Python

Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Jascha Sohl-Dickstein, Samuel Schoenholz, 2020. (ICLR).

Abstract▼ URL

Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices.

#### An Infinite Latent Attribute Model for Network Data

Konstantina Palla, David A. Knowles, Zoubin Ghahramani, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a “flat” clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

#### The Most Persistent Soft-Clique in a Set of Sampled Graphs

Novi Quadrianto, Chao Chen, Christoph Lampert, June 2012. (In 29th International Conference on Machine Learning). Edinburgh, Scotland.

Abstract▼ URL

When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1) is almost fully or at least densely connected, 2) occurs in all or almost all graph instances, and 3) has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a) a max-min two person game optimization problem, or b) a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations.