Natural Language Processing

Techniques for enabling machines to understand, interpret, and generate human language.

A Generative Model of Vector Space Semantics

Jacob Andreas, Zoubin Ghahramani, 2013. (ACL 2013).

Abstract▼ URL

We present a novel compositional, generative model for vector space representations of meaning. This model reformulates earlier tensor-based approaches to vector space semantics as a top-down process, and provides efficient algorithms for transformation from natural language to vectors and from vectors to natural language. We describe procedures for estimating the parameters of the model from positive examples of similar phrases, and from distributional representations, then use these procedures to obtain similarity judgments for a set of adjective-noun pairs. The model’s estimation of the similarity of these pairs correlates well with human annotations, demonstrating a substantial improvement over several existing compositional approaches in both settings.

Rethinking Attention with Performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller, 2021. (In International Conference on Learning Representations).

Abstract▼ URL

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

Can We Automate the Analysis of Online Child Sexual Exploitation Discourse?

Darren Cook, Miri Zilka, Heidi DeSandre, Susan Giles, Adrian Weller, Simon Maskell, 2022. (arXiv).

Abstract▼ URL

Social media’s growing popularity raises concerns around children’s online safety. Interactions between minors and adults with predatory intentions is a particularly grave concern. Research into online sexual grooming has often relied on domain experts to manually annotate conversations, limiting both scale and scope. In this work, we test how well-automated methods can detect conversational behaviors and replace an expert human annotator. Informed by psychological theories of online grooming, we label 6772 chat messages sent by child-sex offenders with one of eleven predatory behaviors. We train bag-of-words and natural language inference models to classify each behavior, and show that the best performing models classify behaviors in a manner that is consistent, but not on-par, with human annotation.

A Psychology-Driven Computational Analysis of Political Interviews

Darren Cook, Miri Zilka, Simon Maskell, Laurence Alison, 2021. (Proc. Interspeech).

Abstract▼ URL

Can an interviewer influence the cooperativeness of an interviewee? The role of an interviewer in actualising a successful interview is an active field of social psychological research. A large-scale analysis of interviews, however, typically involves time-exorbitant manual tasks and considerable human effort. Despite recent advances in computational fields, many automated methods continue to rely on manually labelled training data to establish ground-truth. This reliance obscures explainability and hinders the mobility of analysis between applications. In this work, we introduce a cross-disciplinary approach to analysing interviewer efficacy. We suggest computational success measures as a transparent, automated, and reproducible alternative for pre-labelled data. We validate these measures with a small-scale study with human-responders. To study the interviewer’s influence on the interviewee we utilise features informed by social psychological theory to predict interview quality based on the interviewer’s linguistic behaviour. Our psychologically informed model significantly outperforms a bag-of-words model, demonstrating the strength of a cross-disciplinary approach toward the analysis of conversational data at scale.

Language-independent Bayesian sentiment mining of Twitter

A. Davies, Z. Ghahramani, August 2011. (In In The Fifth Workshop on Social Network Mining and Analysis (SNA-KDD 2011)).

Abstract▼ URL

This paper outlines a new language-independent model for sentiment analysis of short, social-network statuses. We demonstrate this on data from Twitter, modelling happy vs sad sentiment, and show that in some circumstances this outperforms similar Naive Bayes models by more than 10%. We also propose an extension to allow the modelling of differ- ent sentiment distributions in different geographic regions, while incorporating information from neighbouring regions. We outline the considerations when creating a system analysing Twitter data and present a scalable system of data acquisi- tion and prediction that can monitor the sentiment of tweets in real time.

Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP

Z. Jin, J. von Kügelgen, J. Ni, T. Vaidhya, A. Kaushal, M. Sachan, B. Schölkopf, 2021. (In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)). Edited by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih. Association for Computational Linguistics. DOI: 10.18653/v1/2021.emnlp-main.748. Note: *equal contribution.

Abstract▼ URL

The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices.

Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov, Krzysztof M Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller, 2021. (In Advances in Neural Information Processing Systems). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring O(L2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory during training, and still requires O(L) time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

Chefs' Random Tables: Non-Trigonometric Random Features

Valerii Likhosherstov, Krzysztof Choromanski, Avinava Dubey, Frederick Liu, Tamas Sarlos, Adrian Weller, 2022. arXiv. DOI: 10.48550/ARXIV.2205.15317.

Abstract▼ URL

We introduce chefs’ random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality d (not only asymptotically for sufficiently large d, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.

CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices

Valerii Likhosherstov, Jared Davis, Krzysztof Choromanski, Adrian Weller, 13–15 Apr 2021. (In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics). Edited by Arindam Banerjee, Kenji Fukumizu. PMLR. Proceedings of Machine Learning Research.

Abstract▼ URL

We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcome low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform – a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction.

The infinite HMM for unsupervised PoS tagging

J. Van Gael, A. Vlachos, Z. Ghahramani, August 2009. (In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP)). Singapore. Association for Computational Linguistics. ISBN: 978-1-932432-62-6.

Abstract▼ URL

We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations.

Unsupervised and constrained Dirichlet process mixture models for verb clustering

Andreas Vlachos, Anna Korhonen, Zoubin Ghahramani, 2009. (In Proceedings of the workshop on geometrical models of natural language semantics).

Abstract▼

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DP- MMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation per- formed highlights the benefits of both standard and constrained DPMMs com- pared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

Unsupervised and constrained Dirichlet process mixture models for verb clustering

A. Vlachos, A Korhonen, Z. Ghahramani, March 2009. (In 4th Workshop on Statistical Machine Translation, EACL '09). Athens, Greece.

Abstract▼ URL

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Jian Zhang, Zoubin Ghahramani, Yiming Yang, 2004. (In NIPS). Edited by Sebastian Thrun, Lawrence K. Saul, Bernhard Schölkopf. MIT Press. ISBN: 0-262-20152-6.

Abstract▼ URL

In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.