The purpose of this web page is to provide some links for people interested in the application of Bayesian ideas to Machine Learning.
We can read this in the following way: "the probability of the model given the data (P(M|D)) is the probability of the data given the model (P(D|M)) times the prior probability of the model (P(M)) divided by the probability of the data (P(D))".
Bayesian statistics, more precisely, the Cox theorems, tells us that we should use Bayes rule to represent and manipulate our degree of belief in some model or hypothesis. In other words, we should treat degrees of beliefs in exactly the same way as we treat probabilities. Thus, the prior P(M) above represents numerically how much we believe model M to be the true model of the data before we actually observe the data, and the posterior P(M|D) represents how much we believe model M after observing the data. See Chapters 1 and 2 of E T Jaynes' book.
We can think of machine learning as learning models of data. The Bayesian framework for machine learning states that you start out by enumerating all reasonable models of the data and assigning your prior belief P(M) to each of these models. Then, upon observing the data D, you evaluate how probable the data was under each of these models to compute P(D|M). Multiplying this likelihood by the prior and renormalizing results in the posterior probability over models P(M|D) which encapsulates everything that you have learned from the data regarding the possible models under consideration. Thus, to compare two models M and M', we need to compute their relative probability given the data: P(M)P(D|M) / P(M')P(D|M').
Incidentally, if our beliefs are not coherent, in other words, if they violate the rules of probability which include Bayes rule, then the Dutch Book theorem says that if we are willing to accept bets with odds based on the strength of our beliefs, there always exists a set of bets (called a "Dutch book") which we will accept but which is guaranteed to lose us money no matter what the outcome. The only way to avoid being swindled by a Dutch book is to be Bayesian. This has important implications for Machine Learning. If our goal is to design an ideally rational agent, then this agent must represent and manipulate its beliefs using the rules of probability.
In practice, for real world problem domains, applying Bayes rule exactly is usually impractical because it involves summing or integrating over too large a space of models. These computationally intractable sums or integrals can be avoided by using approximate Bayesian methods. There is a very large body of current research on ways of doing approximate Bayesian machine learning. Some examples of approximate Bayesian methods include Laplace's approximation, variational approximations, expectation propagation, and Markov chain Monte Carlo methods (many papers on MCMC can be found in this repository)
Bayesian decision theory deals with the problem of making optimal decisions -- that is, decisions or actions that minimize our expected loss. Let's say we have a choice of taking one of k possible actions A1 ... Ak and we are considering m possible hypothesis for what the true model of the data is: M1 ... Mm. Assume that if the true model of the data is Mi and we take action Aj we incur a loss of Lij dollars. Then the optimal action A* given the data is the one that minimizes the expected loss: In other words A* is the action Aj which has the smallest value of Σi LijP(Mi|D)
We can derive the fundamentals of the branch of machine learning known as reinforcement learning from Bayesian sequential decision theory. See, for example, Michael Duff's PhD Thesis.
For a description of the debate between Bayesians and frequentists see Chapter 37 of David MacKay's excellent textbook.
Tom Minka provides a short but excellent description of some nuances in the use of probability, especially as it relates to machine learning and pattern recognition.
Last modified: Thu Nov 11 12:29:51 GMT 2004