Probabilisitc Machine Learning (4F13) chunks

Probabilistic Machine Learning chunks

The topics and concepts taught in the Probabilisitc Machine Learning course is broken down into a number of chunks, which are detailed in this page. The goal of this organisation is to help students to be able to identify and find material. Chunks are designed to be concise, and fairly self contained, and clearly labeled with content, prerequisites and relationships to other chunks.

The entire course falls naturally in three parts, Gaussian processes, probabilistic ranking and text modeling.

Part I: Supervised non-parametric probabilistic inference using Gaussian processes

In a nutshell, part I is concerned with...

Modelling data
- goals of building a model
- requirements for good models
- data, parameters and latent variables
Linear in the parameters regression
- making predictions, concept of a model
- least squares fit, and the Normal equations
- requires linear algebra
- model complexity: underfitting and overfitting
Likelihood and the concept of noise
- Gaussian independent and identically distributed (iid) noise
- Maximum likelihood fitting
- Equivalence to least squares
- Motivation of inference with multiple hypothesis
Probability fundamentals
- Medical example
- Joint, conditional and marginal probabilities
- The two rules of probability: sum and product
- Bayes' rule
Bayesian inference and prediction with finite regression models
- Likelihood and prior
- Posterior and predictive distribution, with algebra and pictorially
- the marginal likelihood
Background: Some useful Gaussian and Matrix equations
- matrix inversion lemma
- mean and variance of Gaussian
- mean and variance of projection of Gaussian
- marginal and conditional of Gaussian
- products of Gaussians
Marginal likelihood
- Bayesian model selection
- MCMC based explanation of how the marginal likelihood works
- Average the likelihood over the prior: example
Distributions over parameters and over functions
- Concept of prior over functions and over parameters
- nuissance parameters
- Could we sidestep parameters, and work directly with functions?
Gaussian process
- From scalar Gaussians to multivariate Gaussians to Gaussian processes
- Functions are like infinitely long vectors, GPs are distributions over functions
- Marginal and conditional Gaussian
- GP definition
- Conditional generation and joint generation
Gaussian processes and data
- In pictures: prior and posterior
- In algebra: prior and posterior
- An analytic marginal likelihood, and some intuition
Gaussian process marginal likelihood and hyperparameters
- the GP marginal likelihood, and it interpretation
- hyperparameters can control the properties of functions
- example: finding hyperparameters by maximizing the marginal likelihood
- Occam's Razor
Correspondence between Linear in the parameters models and GPs

From linear in the parameters models to GPs
From GPs to linear in the parameters models
Computational considerations: which is more efficient?

covariance functions
- Stationary covariance functions, squared exponential, rational quadratic, Matérn covariance function
- periodic covariance
- neural network covariance function
- Combining simple covariance functions into more interesting ones
The gpml toolbox

Part II: Ranking

Ranking: motivation and tennis example
- Competition in sports and games (TrueSkill problem, match making)
- Tennis: the ATP ranking system explained
- Shortcomings: what does one need to make actual predictions? (who wins?)
- The TrueSkill ranking model
Gibbs sampling
- Calulating integrals using sampling
- Markov chains and invariant distributions
- Gibbs sampling
Gibbs sampling in TrueSkill
- Conditional distributions in TrueSkill are tractable
Representing distributions using factor graphs
- the cost of computing marginal distributions
- algebraic and graphical representations
- local computations on the graph
- message passing: the sum-product rules
message passing in TrueSkill
- messages are not all tractable
Approximate messages using moment matching
- How to approximate a step function by a Gaussian?

Part III: Modeling text

Modeling text

Modeling collections of documents
probabilistic models of text
Bag of words models
Zipf's law

Discrete distributions on binary variables (tossing coins)
- Binary variables and the Bernoulli distribution
- Sequences, the binomial and discrete distributions
- Inference and the Beta distributinos: probabilities over probabilities
Discrete distributions over multiple outcomes

multinomials, categorical and discrete distributions
inference and the Dirichlet prior

Document models

Categorical model
Mixture of categoricals model
Trainig mixture models with EM
A Bayesian mixture model

The Expectation Maximization (EM) algoritm

Maximum likelihood in models with latent variables

Gibbs sampling for Bayesian mixture model

Gibbs sampling
Collapsed Gibbs sampling

Latent Dichichlet Allocation topic models

A more interesting topic model
Inference using Gibbs sampling