Originally from Hamburg, Germany, Julius studied Mathematics (MSci) at Imperial College London and Artificial Intelligence (MSc) at UPC Barcelona and TU Delft. He joined the Cambridge-Tübingen programme in October 2018 where he is supervised by Adrian Weller in Cambridge and Bernhard Schölkopf at the Max Planck Institute for Intelligent Systems in Tübingen. He is funded by a Cambridge-Tübingen PhD fellowship with generous support from Amazon. His research interest include causal inference and its applications to machine learning (e.g., for transfer and reinforcement learning, fairness, and interpretability).

## Publications

#### Probable Domain Generalization via Quantile Risk Minimization

C. Eastwood, A. Robey, S. Singh, J. von Kügelgen, H. Hassani, G. J. Pappas, B. Schölkopf, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc..

Abstract▼ URL

Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the α-quantile of predictor’s risk distribution over domains, QRM seeks predictors that perform well with probability α. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm, and prove: (i) a generalization bound for EQRM; and (ii) that EQRM recovers the causal predictor as α->1. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG, and demonstrate that EQRM outperforms state-of-the-art baselines on CMNIST and several datasets from WILDS and DomainBed.

#### Independent mechanisms analysis, a new concept?

L. Gresele*, J. von Kügelgen*, V. Stimper, B. Schölkopf, M. Besserve, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc.. **Note**: *equal contribution.

Abstract▼ URL

Independent component analysis provides a principled framework for unsupervised representation learning, with solid theory on the identifiability of the latent code that generated the data, given only observations of mixtures thereof. Unfortunately, when the mixing is nonlinear, the model is provably nonidentifiable, since statistical independence alone does not sufficiently constrain the problem. Identifiability can be recovered in settings where additional, typically observed variables are included in the generative process. We investigate an alternative path and consider instead including assumptions reflecting the principle of independent causal mechanisms exploited in the field of causality. Specifically, our approach is motivated by thinking of each source as independently influencing the mixing process. This gives rise to a framework which we term independent mechanism analysis. We provide theoretical and empirical evidence that our approach circumvents a number of nonidentifiability issues arising in nonlinear blind source separation.

#### Causal Inference Through the Structural Causal Marginal Problem

L. Gresele*, J. von Kügelgen*, J. M. Kübler*, E. Kirschbaum, B. Schölkopf, D. Janzing, 2022. (In 39th International Conference on Machine Learning). Edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, Sivan Sabato. PMLR. Note: *equal contribution.

Abstract▼ URL

We introduce an approach to counterfactual inference based on merging information from multiple datasets. We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones. We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs. Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data.

#### Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP

Z. Jin*, J. von Kügelgen*, J. Ni, T. Vaidhya, A. Kaushal, M. Sachan, B. Schölkopf, 2021. (In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)). Edited by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih. Association for Computational Linguistics. **DOI**: 10.18653/v1/2021.emnlp-main.748. **Note**: *equal contribution.

Abstract▼ URL

The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices.

#### Algorithmic recourse under imperfect causal knowledge: a probabilistic approach

A.-H. Karimi*, J. von Kügelgen*, B. Schölkopf, I. Valera, 2020. (In Advances in Neural Information Processing Systems 33). Edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin. Curran Associates, Inc.. **Note**: *equal contribution.

Abstract▼ URL

Recent work has discussed the limitations of counterfactual explanations to recommend actions for algorithmic recourse, and argued for the need of taking causal relationships between features into consideration. Unfortunately, in practice, the true underlying structural causal model is generally unknown. In this work, we first show that it is impossible to guarantee recourse without access to the true structural equations. To address this limitation, we propose two probabilistic approaches to select optimal actions that achieve recourse with high probability given limited causal knowledge (e.g., only the causal graph). The first captures uncertainty over structural equations under additive Gaussian noise, and uses Bayesian model averaging to estimate the counterfactual distribution. The second removes any assumptions on the structural equations by instead computing the average effect of recourse actions on individuals similar to the person who seeks recourse, leading to a novel subpopulation-based interventional notion of recourse. We then derive a gradient-based procedure for selecting optimal recourse actions, and empirically show that the proposed approaches lead to more reliable recommendations under imperfect causal knowledge than non-probabilistic baselines.

#### Optimal experimental design via Bayesian optimization: active causal structure learning for Gaussian process networks

J. von Kügelgen, P. K. Rubenstein, B. Schölkopf, A. Weller, December 2019. (In NeurIPS 2019 Workshop Do the right thing: machine learning and causal inference for improved decision making).

Abstract▼ URL

We study the problem of causal discovery through targeted interventions. Starting from few observational measurements, we follow a Bayesian active learning approach to perform those experiments which, in expectation with respect to the current model, are maximally informative about the underlying causal structure. Unlike previous work, we consider the setting of continuous random variables with non-linear functional relationships, modelled with Gaussian process priors. To address the arising problem of choosing from an uncountable set of possible interventions, we propose to use Bayesian optimisation to efficiently maximise a Monte Carlo estimate of the expected information gain.

#### Semi-Generative Modelling: Covariate-Shift Adaptation with Cause and Effect Features

J. von Kügelgen, A. Mey, M. Loog, 2019. (In 22nd International Conference on Artificial Intelligence and Statistics). Edited by Kamalika Chaudhuri, Masashi Sugiyama. PMLR.

Abstract▼ URL

Current methods for covariate-shift adaptation use unlabelled data to compute importance weights or domain-invariant features, while the final model is trained on labelled data only. Here, we consider a particular case of covariate shift which allows us also to learn from unlabelled data, that is, combining adaptation with semi-supervised learning. Using ideas from causality, we argue that this requires learning with both causes, X_C, and effects, X_E, of a target variable, Y, and show how this setting leads to what we call a semi-generative model, P(Y,X_E|X_C,θ). Our approach is robust to domain shifts in the distribution of causal features and leverages unlabelled data by learning a direct map from causes to effects. Experiments on synthetic data demonstrate significant improvements in classification over purely-supervised and importance-weighting baselines.

#### Semi-supervised learning, causality, and the conditional cluster assumption

J. von Kügelgen, A. Mey, M. Loog, B. Schölkopf, 2020. (In Proceedings of the 36th International Conference on Uncertainty in Artificial Intelligence (UAI)). Edited by Jonas Peters, David Sontag. PMLR. Proceedings of Machine Learning Research. **Note**: *also at NeurIPS 2019 Workshop Do the right thing: machine learning and causal inference for improved decision making.

Abstract▼ URL

While the success of semi-supervised learning (SSL) is still not fully understood, Schölkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target variable from its causes, but possible when predicting it from its effects. Since both these cases are restrictive, we extend their work by considering classification using cause and effect features at the same time, such as predicting a disease from both risk factors and symptoms. While standard SSL exploits information contained in the marginal distribution of all inputs (to improve the estimate of the conditional distribution of the target given in-puts), we argue that in our more general setting we should use information in the conditional distribution of effect features given causal features. We explore how this insight generalises the previous understanding, and how it relates to and can be exploited algorithmically for SSL.

#### From statistical to causal learning

B. Schölkopf*, J. von Kügelgen*, 2022. (In Proceedings of the International Congress of Mathematicians (ICM)). EMS Press. **Note**: *equal contribution.

Abstract▼ URL

We describe basic ideas underlying research to build and understand artificially intelligent systems: from symbolic approaches via statistical learning to interventional models relying on concepts of causality. Some of the hard open problems of machine learning and AI are intrinsically related to causality, and progress may require advances in our understanding of how to model and infer causality from data.

#### Towards causal generative scene models via competition of experts

J. von Kügelgen*, I. Ustyuzhaninov*, P. Gehler, M. Bethge, B. Schölkopf, 2020. (In ICLR 2020 Workshop “Causal Learning for Decision Making”). **Note**: *equal contribution.

Abstract▼ URL

Learning how to model complex scenes in a modular way with recombinable components is a pre-requisite for higher-order reasoning and acting in the physical world. However, current generative models lack the ability to capture the inherently compositional and layered nature of visual scenes. While recent work has made progress towards unsupervised learning of object-based scene representations, most models still maintain a global representation space (i.e., objects are not explicitly separated), and cannot generate scenes with novel object arrangement and depth ordering. Here, we present an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts). During training, experts compete for explaining parts of a scene, and thus specialise on different object classes, with objects being identified as parts that re-occur across multiple scenes. Our model allows for controllable sampling of individual objects and recombination of experts in physically plausible ways. In contrast to other methods, depth layering and occlusion are handled correctly, moving this approach closer to a causal generative scene model. Experiments on simple toy data qualitatively demonstrate the conceptual advantages of the proposed approach.

#### Self-supervised learning with data augmentations provably isolates content from style

J. von Kügelgen*, Y. Sharma*, L. Gresele*, W. Brendel, B. Schölkopf, M. Besserve, F. Locatello, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc.. Note: *equal contribution.

Abstract▼ URL

Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.

#### On the Fairness of Causal Algorithmic Recourse

J. von Kügelgen, A.-H. Karimi, U. Bhatt, I. Valera, A. Weller, B. Schölkopf, 2022. (In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI)).

Abstract▼ URL

Algorithmic fairness is typically studied from the perspective of predictions. Instead, here we investigate fairness from the perspective of recourse actions suggested to individuals to remedy an unfavourable classification. We propose two new fairness criteria at the group and individual level, which – unlike prior work on equalising the average group-wise distance from the decision boundary – explicitly account for causal relationships between features, thereby capturing downstream effects of recourse actions performed in the physical world. We explore how our criteria relate to others, such as counterfactual fairness, and show that fairness of recourse is complementary to fairness of prediction. We study theoretically and empirically how to enforce fair causal recourse by altering the classifier and perform a case study on the Adult dataset. Finally, we discuss whether fairness violations in the data generating process revealed by our criteria may be better addressed by societal interventions as opposed to constraints on the classifier.

#### Complex interlinkages, key objectives and nexuses amongst the Sustainable Development Goals and climate change: a network analysis

F. Laumann, J. von Kügelgen, T. H. Kanashiro Uehara, M. Barahona, 2022. (The Lancet Planetary Health). **DOI**: 10.1016/S2542-5196(22)00070-5.

Abstract▼ URL

Background: Global sustainability is an enmeshed system of complex socioeconomic, climatological, and ecological interactions. The numerous objectives of the UN’s Sustainable Development Goals (SDGs) and the Paris Agreement have various levels of interdependence, making it difficult to ascertain the influence of changes to particular indicators across the whole system. In this analysis, we aimed to detect and rank the complex interlinkages between objectives of sustainability agendas. Methods: We developed a method to find interlinkages among the 17 SDGs and climate change, including non-linear and non-monotonic dependences. We used time series of indicators defined by the World Bank, consisting of 400 indicators that measure progress towards the 17 SDGs and an 18th variable (annual average temperatures), representing progress in the response to the climate crisis, from 2000 to 2019. This method detects significant dependencies among the time evolution of the objectives by using partial distance correlations, a non-linear measure of conditional dependence that also discounts spurious correlations originating from lurking variables. We then used a network representation to identify the most important objectives (using network centrality) and to obtain nexuses of objectives (defined as highly interconnected clusters in the network). Findings: Using temporal data from 181 countries spanning 20 years, we analysed dependencies among SDGs and climate for 35 country groupings based on region, development, and income level. The observed significant interlinkages, central objectives, and nexuses identified varied greatly across country groupings; however, SDG 17 (partnerships for the goals) and climate change ranked as highly important across many country groupings. Temperature rise was strongly linked to urbanisation, air pollution, and slum expansion (SDG 11), especially in country groupings likely to be worst affected by climate breakdown, such as Africa. In several country groupings composed of developing nations, we observed a consistent nexus of strongly interconnected objectives formed by SDG 1 (poverty reduction), SDG 4 (education), and SDG 8 (economic growth), sometimes incorporating SDG 5 (gender equality), and SDG 16 (peace and justice). Interpretation: The differences across groupings emphasise the need to define goals in accordance with local circumstances and priorities. Our analysis highlights global partnerships (SDG 17) as a pivot in global sustainability efforts, which have been strongly linked to economic growth (SDG 8). However, if economic growth and trade expansion were repositioned as a means instead of an end goal of development, our analysis showed that education (SDG 4) and poverty reduction (SDG 1) become more central, thus suggesting that these could be prioritised in global partnerships. Urban livelihoods (SDG 11) were also flagged as important to avoid replicating unsustainable patterns of the past.

#### You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction

O. Makansi, J. von Kügelgen, F. Locatello, P. Gehler, D. Janzing, T. Brox, B. Schölkopf, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

Predicting the future trajectory of a moving agent can be easy when the past trajectory continues smoothly but is challenging when complex interactions with other agents are involved. Recent deep learning approaches for trajectory prediction show promising performance and partially attribute this to successful reasoning about agent-agent interactions. However, it remains unclear which features such black-box models actually learn to use for making predictions. This paper proposes a procedure that quantifies the contributions of different cues to model performance based on a variant of Shapley values. Applying this procedure to state-of-the-art trajectory prediction methods on standard benchmark datasets shows that they are, in fact, unable to reason about interactions. Instead, the past trajectory of the target is the only feature used for predicting its future. For a task with richer social interaction patterns, on the other hand, the tested models do pick up such interactions to a certain extent, as quantified by our feature attribution method. We discuss the limits of the proposed method and its links to causality.

#### Causal Discovery in Heterogeneous Environments Under the Sparse Mechanism Shift Hypothesis

R. Perry, J. von Kügelgen*, B. Schölkopf*, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc.. **Note**: *shared last author.

Abstract▼ URL

Machine learning approaches commonly rely on the assumption of independent and identically distributed (i.i.d.) data. In reality, however, this assumption is almost always violated due to distribution shifts between environments. Although valuable learning signals can be provided by heterogeneous data from changing distributions, it is also known that learning under arbitrary (adversarial) changes is impossible. Causality provides a useful framework for modeling distribution shifts, since causal models encode both observational and interventional distributions. In this work, we explore the sparse mechanism shift hypothesis, which posits that distribution shifts occur due to a small number of changing causal conditionals. Motivated by this idea, we apply it to learning causal structure from heterogeneous environments, where i.i.d. data only allows for learning an equivalence class of graphs without restrictive assumptions. We propose the Mechanism Shift Score (MSS), a score-based approach amenable to various empirical estimators, which provably identifies the entire causal structure with high probability if the sparse mechanism shift hypothesis holds. Empirically, we verify behavior predicted by the theory and compare multiple estimators and score functions to identify the best approaches in practice. Compared to other methods, we show how MSS bridges a gap by both being nonparametric as well as explicitly leveraging sparse changes.

#### Embrace the Gap: VAEs Perform Independent Mechanism Analysis

P. Reizinger*, L. Gresele*, J. Brady*, J. von Kügelgen, D. Zietlow, B. Schölkopf, G. Martius, W. Brendel, M. Besserve, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc.. Note: *equal first authorship.

Abstract▼ URL

Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder – a commonly used but unproven conjecture – which we refer to as self-consistency. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.

#### Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

L. Schott, J. von Kügelgen, F. Träuble, P. Gehler, C. Russell, M. Bethge, B. Schölkopf, F. Locatello, W. Brendel, 2022. (In 10th International Conference on Learning Representations).

Abstract▼ URL

An important component for generalization in machine learning is to uncover underlying latent factors of variation as well as the mechanism through which each factor acts in the world. In this paper, we test whether 17 unsupervised, weakly supervised, and fully supervised representation learning approaches correctly infer the generative factors of variation in simple datasets (dSprites, Shapes3D, MPI3D) from controlled environments, and on our contributed CelebGlow dataset. In contrast to prior robustness work that introduces novel factors of variation during test time, such as blur or other (un)structured noise, we here recompose, interpolate, or extrapolate only existing factors of variation from the training data set (e.g., small and medium-sized objects during training and large objects during testing). Models that learn the correct mechanism should be able to generalize to this benchmark. In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. Moreover, the generalization capabilities of all tested models drop significantly as we move from artificial datasets towards more realistic real-world datasets. Despite their inability to identify the correct mechanism, the models are quite modular as their ability to infer other in-distribution factors remains fairly stable, providing only a single factor is out-of-distribution. These results point to an important yet understudied problem of learning mechanistic models of observations that can facilitate generalization.

#### Unsupervised Object Learning via Common Fate

M. Tangemann, S. Schneider, J. von Kügelgen, F. Locatello, P. Gehler, T. Brox, M. Kümmerer, M. Bethge, B. Schölkopf, 2021.

Abstract▼ URL

Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling. We decompose this problem into three easier subtasks, and provide candidate solutions for each of them. Inspired by the Common Fate Principle of Gestalt Psychology, we first extract (noisy) masks of moving objects via unsupervised motion segmentation. Second, generative models are trained on the masks of the background and the moving objects, respectively. Third, background and foreground models are combined in a conditional “dead leaves” scene model to sample novel scene configurations where occlusions and depth layering arise naturally. To evaluate the individual stages, we introduce the Fishbowl dataset positioned between complex real-world scenes and common object-centric benchmarks of simplistic objects. We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos, and represent scenes in a modular fashion that allows sampling plausible scenes outside the training distribution by permitting, for instance, object numbers or densities not observed in the training set.

#### Active Bayesian Causal Inference

C. Toth, L. Lorch, C. Knoll, A. Krause, F. Pernkopf, R. Peharz*, J. von Kügelgen*, 2022. (In Advances in Neural Information Processing Systems 35). Curran Associates, Inc.. **Note**: *shared last author.

Abstract▼ URL

Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference – other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.

#### Backward-Compatible Prediction Updates: A Probabilistic Approach

F. Träuble, J. von Kügelgen, M. Kleindessner, F. Locatello, B. Schölkopf, P. Gehler, 2021. (In Advances in Neural Information Processing Systems 34). Edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, J. Wortman Vaughan. Curran Associates, Inc..

Abstract▼ URL

When machine learning systems meet real world applications, accuracy is only one of several requirements. In this paper, we assay a complementary perspective originating from the increasing availability of pre-trained and regularly improving state-of-the-art models. While new improved models develop at a fast pace, downstream tasks vary more slowly or stay constant. Assume that we have a large unlabelled data set for which we want to maintain accurate predictions. Whenever a new and presumably better ML models becomes available, we encounter two problems: (i) given a limited budget, which data points should be re-evaluated using the new model?; and (ii) if the new predictions differ from the current ones, should we update? Problem (i) is about compute cost, which matters for very large data sets and models. Problem (ii) is about maintaining consistency of the predictions, which can be highly relevant for downstream applications; our demand is to avoid negative flips, i.e., changing correct to incorrect predictions. In this paper, we formalize the Prediction Update Problem and present an efficient probabilistic approach as answer to the above questions. In extensive experiments on standard classification benchmark data sets, we show that our method outperforms alternative strategies along key metrics for backward-compatible prediction updates.