Probabilistic Machine Learning chunks
The topics and concepts taught in the Probabilisitc Machine Learning
course is broken down into a number of chunks, which are
detailed in this page. The goal of this organisation is to help
students to be able to identify and find material. Chunks are designed
to be concise, and fairly self contained, and clearly labeled with
content, prerequisites and relationships to other chunks.
The entire course falls naturally in three parts, Gaussian processes, probabilistic
ranking and text modeling.
Part I: Supervised non-parametric probabilistic
inference using Gaussian processes
In a nutshell, part I is concerned with...
- Modelling data
- goals of building a model
- requirements for good models
- data, parameters and latent variables
- Linear in
the parameters regression
- making predictions, concept of a model
- least squares fit, and the Normal equations
- requires linear algebra
- model complexity: underfitting and overfitting
- Likelihood and the concept of noise
- Gaussian independent and identically distributed (iid) noise
- Maximum likelihood fitting
- Equivalence to least squares
- Motivation of inference with multiple hypothesis
- Probability fundamentals
- Medical example
- Joint, conditional and marginal probabilities
- The two rules of probability: sum and product
- Bayes' rule
- Bayesian inference and
prediction with finite regression models
- Likelihood and prior
- Posterior and predictive distribution, with algebra and pictorially
- the marginal likelihood
- Background: Some useful
Gaussian and Matrix equations
- matrix inversion lemma
- mean and variance of Gaussian
- mean and variance of projection of Gaussian
- marginal and conditional of Gaussian
- products of Gaussians
- Marginal likelihood
- Bayesian model selection
- MCMC based explanation of how the marginal likelihood works
- Average the likelihood over the prior: example
- Distributions over
parameters and over functions
- Concept of prior over functions and over parameters
- nuissance parameters
- Could we sidestep parameters, and work directly with functions?
- Gaussian process
- From scalar Gaussians to multivariate Gaussians to Gaussian processes
- Functions are like infinitely long vectors, GPs are
distributions over functions
- Marginal and conditional Gaussian
- GP definition
- Conditional generation and joint generation
- Gaussian processes and data
- In pictures: prior and posterior
- In algebra: prior and posterior
- An analytic marginal likelihood, and some intuition
- Gaussian process marginal likelihood
and hyperparameters
- the GP marginal likelihood, and it interpretation
- hyperparameters can control the properties of functions
- example: finding hyperparameters by maximizing the marginal
likelihood
- Occam's Razor
- Correspondence between Linear in the
parameters models and GPs
- From linear in the parameters models to GPs
- From GPs to linear in the parameters models
- Computational considerations: which is more efficient?
- covariance functions
- Stationary covariance functions, squared exponential, rational
quadratic, Matérn covariance function
- periodic covariance
- neural network covariance function
- Combining simple covariance functions into more interesting ones
- The gpml toolbox
Part II: Ranking
- Ranking: motivation and tennis example
- Competition in sports and games (TrueSkill problem, match making)
- Tennis: the ATP ranking system explained
- Shortcomings: what does one need to make actual predictions?
(who wins?)
- The TrueSkill ranking model
- Gibbs sampling
- Calulating integrals using sampling
- Markov chains and invariant distributions
- Gibbs sampling
- Gibbs sampling in TrueSkill
- Conditional distributions in TrueSkill are tractable
- Representing distributions using
factor graphs
- the cost of computing marginal distributions
- algebraic and graphical representations
- local computations on the graph
- message passing: the sum-product rules
- message passing in
TrueSkill
- messages are not all tractable
- Approximate messages using moment
matching
- How to approximate a step function by a Gaussian?
Part III: Modeling text
- Modeling text
- Modeling collections of documents
- probabilistic models of text
- Bag of words models
- Zipf's law
- Discrete distributions on binary
variables (tossing coins)
- Binary variables and the Bernoulli distribution
- Sequences, the binomial and discrete distributions
- Inference and the Beta distributinos: probabilities
over probabilities
- Discrete distributions over multiple
outcomes
- multinomials, categorical and discrete distributions
- inference and the Dirichlet prior
- Document models
- Categorical model
- Mixture of categoricals model
- Trainig mixture models with EM
- A Bayesian mixture model
- The Expectation
Maximization (EM) algoritm
- Maximum likelihood in models with latent variables
- Gibbs sampling for
Bayesian mixture model
- Gibbs sampling
- Collapsed Gibbs sampling
- Latent Dichichlet Allocation topic models
- A more interesting topic model
- Inference using Gibbs sampling