Inference and planning on factor graphs
This document is available as
- pdf http://www.marc-toussaint.net/06-infer/index.pdf
- html http://www.marc-toussaint.net/06-infer/index.html
Software is available via subversion from
- svn co https://infer.svn.sourceforge.net/svnroot/infer infer
- Related pages
- Inference and belief propagation in a nutshell
- The factor graph data structure and generic operations
- The data structure - a factor graph
- States of the data structure
- Message exchange
- Approximate message exchange
- Approximate message exchange (e.g., expectation propagation / assume density filtering)
- Mean field approximations
- Particle filtering
- Generic operations and testing
- Factor types
- Mixture of Gaussians
- Unscented Transform Gaussian factor
- Linearized Function Gaussian factor
- Switch factors for relational/logic models
- Decision tree factor
- Particle factors
- Appendix: Gaussians
I am interested in a new approach to planning and motor control based on probabilistic inference techniques. This approaches aims at solving planning and control problems on structured domains - structured means that multiple state variables exist and that the system dynamics is described by some Dynamic Bayesian Network (which can express factorial, hierarchical, and mixed discrete/continuous domains). Inference techniques such as belief propagation, variational techniques or particle based approach can then be used to solve control and planning problems (solve MDPs/POMDPs).
The goals of this document are to
- provide a reference for factor graphs, message passing, and
specific factor instantiations that can directly be translated to
the implementation in the library,
- collect resources and pointers to existing inference software and related pages.
- basic low level computations with Gaussians etc are robust
- simple DBNs (HMMs, Kalman filters, switching state space models with MoGs) tested and work
- generic unit testing for factor implementations: Tables, Gaussians pass fully, MoGs and locally-linear-Gaussians pass approximately.
The software will implement inference on factor graphs in a quite generic way, based on a virtual definition of factors. At least the following algorithms should thereby becomes special cases:
- message passing algorithms such as Belief Propagation (BP),
Loopy BP, Junction Tree, BP with any order of message exchanges
- Expectation Propagation, Gaussian (Kalman) filters/smoothers,
all message exchange schemes as above with Gaussian variables,
mixture-of-Gaussians, and factors that implement
`quasi-non-linear' dependencies between Gaussian variables (using
local-linearization or the Unscented Transform)
- Variational approximation, extraction of submodels (e.g., a
single variable Markov chain from a DBN), Factorial Hidden Markov
- Particle representations - not yet
Concerning the style of implementation, the software is meant for scientists that are familiar with the theoretical background. Special aims are
- to provide a basic data structure for describing probabilistic
models that is clear, transparent, and as general as possible
- algorithms that realize inference algorithms in a generic way -
trying to be rather independent of the type of variables and
factors/conditional probabilities in the model (independent of
whether they are tables, Gaussians, MoGs, etc)
- algorithms that manipulate models, i.e., take the factor graph as an input and return a new one, e.g. to implement extraction of sub-models
A great overview over existing inference software is given by
- Kevin Murphy
A selection of software is the following:
- Bayes Net Toolbox (BNT)
- - Kevin Murphy
A Matlab toolbox.
- Graphical Model Tool Kit (GMTK)
- - Jeff Blimes
- Probabilistic Network Library (PNL)
- - Intel
C++ version of the BNT. Kevin Murphy is involved.
- Tayla Meltzer's c-inference
This inspired me most: Simple and very clean implementation of various algorithms based on factor graphs (mainly used to represent Markov Random Fields
- This is a far too brief and maybe obscure intro to BP. Its origin was me trying to explain BP in a single lecture, without introducing first the JTA or forward-backward, etc. -
Inference is a generic kind of information processing: If we have information (= a distribution) over one variable and know about dependencies to other variables, then we'd like to compute information (a marginal distribution) about one of the other variables.
So, in the problem of inference, we assume that we know all the dependencies explicitly, say some expert came up with them as an assumed generative model of data. We can store the terms explicitly as tables or Gaussians in the computer. The goal is to find out what the marginals over random variables (or pairs/groups of random variables) are.
Among many ways to do so, one way is to try to transform the joint
from the form (1) into the form
Here, the joint is factorized in groups of variables (cliques) and has in the denominator factors which refer to the intersection of variables from two groups ( ) (junctions). E.g., if we had four variables where indicates Markovian dependency, then (2) would read
which certainly is a viable way to write the joint.
If we are able to represent the joint in form (2) we won because each factor is unconditional, i.e., it is a proper marginal over the group of variables. From this form of the joint we can directly read off any marginal that we are interested in - we solved the problem of inference.
The question is, how to get the joint from the form (1) for
which we explicitly know all terms (stored as tables or so in the
computer) to the form (2) for which we have no idea yet what
the terms are. The scheme is the following: One first rewrites
(1) with factors
simply by initializing all 's to 1 and by grouping the terms in (1) in factors which depend only on a subset of variables, such that
This rewriting is trivial, except that we had to decide on the grouping of variables (the cliques). These 's and 's are now the tables (or Gaussians or ...) that we store and manipulate in the computer.
Now we iteratively manipulate the 's and 's until they end up what we hoped for, that is, until becomes and becomes . Two simple constraints on how we should manipulate the 's and the 's give us a very good idea of how to do the manipulation:
First, we always want to manipulate 's and 's in such a way that the joint distribution (4) remains unchanged. That means: whenever we multiply some term to some we should multiply the same term to some such that the total quotient in (4) is unchanged.
Second, the manipulation should be such that if we reached our goal (i.e., and are proper marginals) then the manipulations shouldn't change anything at all anymore.
Here is one way to do the manipulation: Choose a junction and
Note that both constraints are fulfilled: the joint remains the same because of the quotient between new and old in (7); and if 's and 's are already the desired proper marginals, then (6) will give .
In very loose words: Equation (6) can be interpreted as pushing towards the marginal over - well, at least is being assigned to what beliefs the marginal over is: the summation in (6) is like computing the `marginal of ' over . So if already has a good belief of what the marginal over is, then is assigned to it and, in equation (7), the neighboring factor is being updated to absorb this information about while at the same time keeping the total joint unchanged. This is a message exchange from to via the junction .
Unfortunately, there is a little flaw about this kind of message
passing: Say doesn't have information yet - it is uniform.
If sends a belief to and then, directly afterwards,
would send exactly the same belief back to . In
effect, potentiates its own belief without any further
information from outside (like a self-confirming effect). To avoid
this, every factor should keep track what information it has send out
previously. All information that flows in should then be considered
relative to what has been send out previously (consider only the
`news'). This is done by splitting up the factor in
two factors and , such that
, and passing messages as
The difference between (8) and (6) is the devision by . This means that the message send from to is 's belief relative to the message that has previously been sending to . As before, (9) ensures that the full joint remains unchanged.
Hopefully, if we pass messages between all factors for long enough, the scheme will converge to the desired representation (2) of the joint. [Completely neglected here: order of message passings, guarantees of the junction tree algorithm, forward-backward in HMMs, issues with loopy BP, approximate message passing, and literature given more rigorous introductions.]
Let there be random variables . We assume that the joint
probability can be written as a product of factors
where is a tuple of indices in
specifies the variables the th factor depends on
In the infer package, different types of factores are provided: tables, Gaussians, MoGs, Gaussians with quasi-non-linear dependencies, etc - see below. But the basic operations on the factor graph discussed in this section are largely independent of the type of factors.
Actually, providing the list of factors and the list of tuples is sufficient to fully determine a model. Based on this, some algorithms could analyze the dependencies, e.g., find a Junction Tree, and thereby decide on elimination orders etc. The result of such analysis would be arcs between factors. We include this in the basic data structure.
Let be a graph with nodes. The th node is associated with the factor . Additionally the graph contains edges between nodes.
For each edge we can compute the intersection of random
variables that and depend on; we use to denote
the index tupel of this intersection,
The random variables are the basis of communication between the factors and in belief propagation type algorithms - the factors will exchange their beliefs on the marginal of .
As the final part of the basic data structure, we also associate two factors to each adge - one as a memory for forward messages and one for backward messages. We denote them by and .
The data structure consists of a graph with nodes and some edges. For each node we have and for each edge we have .
As abbreviation we write
Let's define the data structure is...
- junction complete
- if there always is an edge , when
and share variables
- a junction tree
- if a junction complete tree and all junctions
are disjoint (`running intersection property')
- if the factors can be separated in single variable
factors and clique factors and only interconnections (no
intra-connections) between the groups are present
- in initial state
- if all
such that one can
- with the original factors
current factors and incoming messages fulfill
Note that in this case
and the original joint is still correctly encoded in the data structure.
- in equilibrium
- if all factors and edges agree on the junction
marginals, which is
The infer package includes routines to check the state of the data structure, i.e., whether it is still consistent or in equilibrium. These are an important tool for debugging.
Say and share some variables and let and be their marginals w.r.t. these shared variables. A (directed!)
message exchange from a factor to another is (dashes mean
The updates are written in many different forms to see alternatives for the implementation. Intuitively: The message reflects the marginal of leaving out all information that has previously been exchanged between and . `summarizes' all messages that have yet been send from to . Alternative thinking: The quotients and that appear in these equations are the original (initial state) factors multiplies with all incoming messages except for the ones from and , respectively,
Note that this message exchange preserves consistency because
Second, note that in equilibrium and thus and the message exchange has no effect.
[[ realized yet for MoGs ]]
On factor graphs with mixture of Gaussians (say a Markov chain/switching state space model), the marginal of a factor usually contains more Gaussian modes than a message can represent (the number of models would grow exponentially on a Markov chain). Hence, marginalization for a MoG is implemented approximate by collapsing all Gaussians that correspond to the same mixture component.
Factors or messages are sometimes limited to represent only a class of
distributions (e.g., Gaussians in the continuous case), then one might
not be able to do an exact message exchange because is not
in this class of distributions. The approximate message exchange then
where the approximation refers to some Assumed Density Filtering approach. In this case, the message factors should be updated according to
because only this update preserves consistency ( would not because of the approximation).
[[ todo ]]
Given a graph with only two factors and . A mean field approximation would compute the current b
Mean field approximations take the
[[ Nando's paper: Fast Particle Smoothing: If I had a million particles
How can you rearrange particles from the forward and backward filter? ]]
Think of the Markov chain
continuous transitions and observations and . is
represented as a population (fwd particle filter), is
represented as a population (backwrad particle filter). I want
To implement inference algorithms, only a few basic operations need to
be provided to manipulate factors ('s and 's). These are
and for practical reasons also some more trivial methods
In the implementation, a class factor will define these methods as virtuals. Factor implementations derived from this base class implement these methods.
A robust way to check the validity of a factor implementation is to define tests in the spirit of unit testing. We define that a valid factor has to pass the following randomized tests:
Let and be two cliques of variables, each with a random number of variables, each variable with random dimensionality/cardinality, and with sharing random variables .
Some factor implementations will pass these tests only approximately. E.g., a mixture of Gaussians can pass test 1 and 2, but not exactly test 3.
is discrete and is stored as a table. The operations are conceptually straightforward to implement.
In practice, to store a table factor one should always store a table
scaled to sum to 1 and additionaly a log scaling such
In that way, over- and underflow in the table is buffed in the scaling . (In HMMs, e.g., will end up being the data log-likelihood.)
is a continuous variable and . Product, devision and marginal product are best implemented in the canonical representation . Only the marginal extraction requires matrix inversions, which are the Schur complements: The marginal precision matrix for for a joint precision matrix is . See the appendix for details.
Again, in addition to the and we store a scaling
Two variables and , Gaussian over , a generic (non-linear) but invertible(!) function from to . Using the UT to map from to and back.
Two variables and , Gaussian over , a generic (non-linear, perhaps redundant) function from to which provides a local linearization (the Jacobian). The fwd mapping from to is trivial using the linearization around the mean of . The backward mapping is tricky in the case of redundancy:
- Linearize around the mean of (the must be some meaningful prior on otherwise it doesn't work!),
- redundantly pull back the Gaussian on , blowing up variance along the null space,
- execute multiplication with ,
- relinearize the function and the new mean, and repeat from (ii)
Consider random variables of same type (in the same
space) and a discrete random variable
and , we have another random variable which depends
only on :
In general terms, depends on all and and we have . The definition above though captures much more structure in this dependency.
We introduce switch factors to `copy' the random variables into an additional buffer random variable . That way we can then introduce a simple factor for .
A switch factor is of the form
where and store marginal factors over and respectively.
It should be possible to implement all core methods.
[[work in progress]]
[[not approached yet]]
[[ideas, but not approached yet]]
In normal form
In canonical form (with )
Non-normalized (e.g., for propagating likelihoods)
We write to abbreviate and to abbreviate . We give decompositions . Equations are true in the limit .
For : Jensen-Shannon divergence.
Collapsing a MoG into a single Gaussian
- technical notes -
- draft -
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
The command line arguments were:
latex2html -split 0 -no_navigation index.tex
The translation was initiated by mt on 2007-07-18