# Semi-Supervised Learning

Often, it is easy and cheap to obtain large amounts of unlabelled data (e.g. images, text documents), while it is hard or expensive to obtain labelled data. Semi-supervised learning methods attempt to use the unlabelled data to improve the performance on supervised learning tasks, such as classification.

#### Archipelago: nonparametric Bayesian semi-supervised learning

R. Adams, Zoubin Ghahramani, June 2009. (In 26th International Conference on Machine Learning). Edited by Léon Bottou, Michael Littman. Montréal, QC, Canada. Omnipress.

Abstract▼ URL

Semi-supervised learning (SSL), is classification where additional unlabeled data can be used to improve accuracy. Generative approaches are appealing in this situation, as a model of the data’s probability density can assist in identifying clusters. Nonparametric Bayesian methods, while ideal in theory due to their principled motivations, have been difficult to apply to SSL in practice. We present a nonparametric Bayesian method that uses Gaussian processes for the generative model, avoiding many of the problems associated with Dirichlet process mixture models. Our model is fully generative and we take advantage of recent advances in Markov chain Monte Carlo algorithms to provide a practical inference method. Our method compares favorably to competing approaches on synthetic and real-world multi-class data.

**Comment:** This paper was awarded Honourable Mention for Best Paper at ICML 2009.

#### The Rendezvous algorithm: multiclass semi-supervised learning with Markov random walks

Arik Azran, June 2007. (In 24th International Conference on Machine Learning). Edited by Zoubin Ghahramani. Corvallis, OR, USA. Omnipress.

Abstract▼ URL

We consider the problem of multiclass classification where both labeled and unlabeled data points are given. We introduce and demonstrate a new approach for estimating a distribution over the missing labels where data points are viewed as nodes of a graph, and pairwise similarities are used to derive a transition probability matrix P for a Markov random walk between them. The algorithm associates each point with a particle which moves between points according to P. Labeled points are set to be absorbing states of the Markov random walk, and the probability of each particle to be absorbed by the different labeled points, as the number of steps increases, is then used to derive a distribution over the associated missing label. A computationally efficient algorithm to implement this is derived and demonstrated on both real and artificial data sets, including a numerical comparison with other methods.

#### Semi-supervised kernel regression using whitened function classes

Matthias O. Franz, Younghee Kwon, Carl Edward Rasmussen, Bernhard Schölkopf, 2004. (In Lecture Notes in Computer Science (LNCS)). Edited by C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, B. Schölkopf. (Pattern Recognition, Proceedings of the 26th DAGM Symposium). Berlin, Germany. Springer.

Abstract▼ URL

The use of non-orthonormal basis functions in ridge regression leads to an often undesired non-isotropic prior in function space. In this study, we investigate an alternative regularization technique that results in an implicit whitening of the basis functions by penalizing directions in function space with a large prior variance. The regularization term is computed from unlabelled input data that characterizes the input distribution. Tests on two datasets using polynomial basis functions showed an improved average performance compared to standard ridge regression.

#### Unsupervised Many-to-Many Object Matching for Relational Data

Tomoharu Iwata, James Robert Lloyd, Zoubin Ghahramani, 2015. (IEEE Transactions on Pattern Analysis and Machine Intelligence).

Abstract▼ URL

We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. The effectiveness of the proposed method is experimentally demonstrated by using synthetic and real relational data sets, which include applications to cross-domain recommendation without shared user/item identifiers and multi-lingual word clustering.

#### Semi-supervised learning, causality, and the conditional cluster assumption

J. von Kügelgen, A. Mey, M. Loog, B. Schölkopf, 2020. (In Proceedings of the 36th International Conference on Uncertainty in Artificial Intelligence (UAI)). Edited by Jonas Peters, David Sontag. PMLR. Proceedings of Machine Learning Research. **Note**: *also at NeurIPS 2019 Workshop Do the right thing: machine learning and causal inference for improved decision making.

Abstract▼ URL

While the success of semi-supervised learning (SSL) is still not fully understood, Schölkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target variable from its causes, but possible when predicting it from its effects. Since both these cases are restrictive, we extend their work by considering classification using cause and effect features at the same time, such as predicting a disease from both risk factors and symptoms. While standard SSL exploits information contained in the marginal distribution of all inputs (to improve the estimate of the conditional distribution of the target given in-puts), we argue that in our more general setting we should use information in the conditional distribution of effect features given causal features. We explore how this insight generalises the previous understanding, and how it relates to and can be exploited algorithmically for SSL.

#### Semi-Supervised Domain Adaptation with Non-Parametric Copulas

David Lopez-Paz, José Miguel Hernández-Lobato, Bernhard Scholköpf, December 2012. (In Advances in Neural Information Processing Systems 26). Lake Tahoe, California, USA.

Abstract▼ URL

A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques.

#### Classification using log Gaussian Cox processes

Alexander G. D. G. Matthews, Zoubin Ghahramani, 2014. (arXiv preprint arXiv:1405.4141).

Abstract▼ URL

McCullagh and Yang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.

#### Traffic Classification in Information Poor Environments

C. Rotsos, J. Van Gael, A.W. Moore, Z. Ghahramani, July 2010. (In 1st International Workshop on Traffic Analysis and Classification (IWCMC '10)). Caen, France.

Abstract▼ URL

Traffic classification using machine learning continues to be an active research area. The majority of work in this area uses *off-the-shelf* machine learning tools and treats them as *black-box* classifiers. This approach turns all the modelling complexity into a feature selection problem. In this paper, we build a problem-specific solution to the traffic classification problem by designing a custom probabilistic graphical model. Graphical models are a modular framework to design classifiers which incorporate domain-specific knowledge. More specifically, our solution introduces semi-supervised learning which means we learn from both labelled and unlabelled traffic flows. We show that our solution performs competitively compared to previous approaches while using less data and simpler features.

#### Safe semi-supervised learning of sum-product networks

Martin Trapp, Tamas Madl, Robert Peharz, Franz Pernkopf, Robert Trappl, August 2017. (In 33st Conference on Uncertainty in Artificial Intelligence). Sidney, Australia.

Abstract▼ URL

In several domains obtaining class annotations is expensive while at the same time unlabelled data are abundant. While most semi-supervised approaches enforce restrictive assumptions on the data distribution, recent work has managed to learn semi-supervised models in a non-restrictive regime. However, so far such approaches have only been proposed for linear models. In this work, we introduce semi-supervised parameter learning for Sum-Product Networks (SPNs). SPNs are deep probabilistic models admitting inference in linear time in number of network edges. Our approach has several advantages, as it (1) allows generative and discriminative semi-supervised learning, (2) guarantees that adding unlabelled data can increase, but not degrade, the performance (safe), and (3) is computationally efficient and does not enforce restrictive assumptions on the data distribution. We show on a variety of data sets that safe semi-supervised learning with SPNs is competitive compared to state-of-the-art and can lead to a better generative and discriminative objective value than a purely supervised approach.

#### Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions

Xiaojin Zhu, Zoubin Ghahramani, John D. Lafferty, 2003. (In ICML). Edited by Tom Fawcett, Nina Mishra. AAAI Press. **ISBN**: 1-57735-189-4.

Abstract▼ URL

An approach to semi-supervised learning is proposed that is based on a Gaussian random field model. Labeled and unlabeled data are represented as vertices in a weighted graph, with edge weights encoding the similarity between instances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propagation. The resulting learning algorithms have intimate connections with random walks, electric networks, and spectral graph theory. We discuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm’s ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text classification tasks.

#### Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, John D. Lafferty, 2004. (In NIPS). Edited by Sebastian Thrun, Lawrence K. Saul, Bernhard Schölkopf. MIT Press. **ISBN**: 0-262-20152-6.

Abstract▼ URL

We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results.