# Information Retrieval

Information retrieval concerns develping systems that find material from within a large unstructured collection (e.g. the internet) that satisfy the user's need. The best example of such systems are web search engines, such as Google, but there are many other specialized applications of information retrieval (such as collaborative filtering and recommender systems). Information retrieval can be thought of as an inference problem: given the user's query, what are the relevant items in the data collection?

#### Bayesian Sets

Zoubin Ghahramani, Katherine A. Heller, December 2006. (In Advances in Neural Information Processing Systems 18). Edited by Y. Weiss, B. Schölkopf, J. Platt. Cambridge, MA, USA. The MIT Press.

Abstract▼ URL

Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a model-based concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions.

#### A Simple Bayesian Framework for Content-Based Image Retrieval

Katherine A. Heller, Zoubin Ghahramani, 2006. (In CVPR). IEEE Computer Society. **ISBN**: 0-7695-2597-0.

Abstract▼ URL

We present a Bayesian framework for content-based image retrieval which models the distribution of color and texture features within sets of related images. Given a userspecified text query (e.g. “penguins”) the system first extracts a set of images, from a labelled corpus, corresponding to that query. The distribution over features of these images is used to compute a Bayesian score for each image in a large unlabelled corpus. Unlabelled images are then ranked using this score and the top images are returned. Although the Bayesian score is based on computing marginal likelihoods, which integrate over model parameters, in the case of sparse binary data the score reduces to a single matrix-vector multiplication and is therefore extremely efficient to compute. We show that our method works surprisingly well despite its simplicity and the fact that no relevance feedback is used. We compare different choices of features, and evaluate our results using human subjects.

#### A Scalable Gibbs Sampler for Probabilistic Entity Linking

Neil Houlsby, Massimiliano Ciaramita, 2014. (In 36th European Conference on Information Retrieval). Springer.

Abstract▼ URL

Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset.

#### Exploration in two-stage recommender systems

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, 2020. (REVEAL (ACM RecSys workshop)).

Abstract▼ URL

Two-stage recommender systems are widely adopted in industry due to their scalability and maintainability. These systems produce recommendations in two steps: (i) multiple nominators preselect a small number of items from a large pool using cheap-to-compute item embeddings; (ii) with a richer set of features, a ranker rearranges the nominated items and serves them to the user. A key challenge of this setup is that optimal performance of each stage in isolation does not imply optimal global performance. In response to this issue, Ma et al. (2020) proposed a nominator training objective importance weighted by the ranker’s probability of recommending each item. In this work, we focus on the complementary issue of exploration. Modeled as a contextual bandit problem, we find LinUCB (a near optimal exploration strategy for single-stage systems) may lead to linear regret when deployed in two-stage recommenders. We therefore propose a method of synchronising the exploration strategies between the ranker and the nominators. Our algorithm only relies on quantities already computed by standard LinUCB at each stage and can be implemented in three lines of additional code. We end by demonstrating the effectiveness of our algorithm experimentally.

#### On component interactions in two-stage recommender systems

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, 2021. (NeurIPS).

Abstract▼ URL

Thanks to their scalability, two-stage recommenders are used by many of today’s largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators, tuned for low prediction latency, preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at K.

#### Modelling content creator incentives on algorithm-curated platforms

Jiri Hron, Karl Krauth, Michael I. Jordan, Niki Kilbertus, Sarah Dean, 2022. (arXiv).

Abstract▼ URL

Content creators compete for user attention. Their reach crucially depends on algorithmic choices made by developers on online platforms. To maximize exposure, many creators adapt strategically, as evidenced by examples like the sprawling search engine optimization industry. This begets competition for the finite user attention pool. We formalize these dynamics in what we call an exposure game, a model of incentives induced by algorithms including modern factorization and (deep) two-tower architectures. We prove that seemingly innocuous algorithmic choices—e.g., non-negative vs. unconstrained factorization—significantly affect the existence and character of (Nash) equilibria in exposure games. We proffer use of creator behavior models like ours for an (ex-ante) pre-deployment audit. Such an audit can identify misalignment between desirable and incentivized content, and thus complement post-hoc measures like content filtering and moderation. To this end, we propose tools for numerically finding equilibria in exposure games, and illustrate results of an audit on the MovieLens and LastFM datasets. Among else, we find that the strategically produced content exhibits strong dependence between algorithmic exploration and content diversity, and between model expressivity and bias towards gender-based user and creator groups.

#### SIGMa: simple greedy matching for aligning large knowledge bases

Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, Zoubin Ghahramani, 2013. (In KDD). Association for Computing Machinery. **ISBN**: 978-1-4503-2174-7.

Abstract▼ URL

The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world’s largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

#### Analogical Reasoning with Relational Bayesian Sets

Ricardo Silva, Katherine A. Heller, Zoubin Ghahramani, 2007. (In AISTATS). Edited by Marina Meila, Xiaotong Shen. JMLR.org. JMLR Proceedings.

Abstract▼ URL

Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. There are many ways in which objects can be related, making automated analogical reasoning very chal- lenging. Here we develop an approach which, given a set of pairs of related objects S = A1:B1,A2:B2,…,AN:BN, measures how well other pairs A:B fit in with the set S. This addresses the question: is the relation between objects A and B analogous to those relations found in S? We recast this classi- cal problem as a problem of Bayesian analy- sis of relational data. This problem is non- trivial because direct similarity between ob- jects is not a good way of measuring analo- gies. For instance, the analogy between an electron around the nucleus of an atom and a planet around the Sun is hardly justified by isolated, non-relational, comparisons of an electron to a planet, and a nucleus to the Sun. We develop a generative model for predicting the existence of relationships and extend the framework of Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous a relation is to other relations. This sheds new light on an old problem, which we motivate and illustrate through practical applications in exploratory data analysis.

#### Ranking Relations Using Analogies in Biological and Information Networks

R. Silva, K. A. Heller, Z. Ghahramani, E. M. Airoldi, 2010. (Annals of Applied Statistics).

Abstract▼ URL

Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = A(1):B(1), A(2):B(2), …, A(N):B(N), measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.