Inference and planning on factor graphs

Inference and planning on factor graphs - technical notes - - draft -

This document is available as

Software is available via subversion from


Contents

Mission

I am interested in a new approach to planning and motor control based on probabilistic inference techniques. This approaches aims at solving planning and control problems on structured domains - structured means that multiple state variables exist and that the system dynamics is described by some Dynamic Bayesian Network (which can express factorial, hierarchical, and mixed discrete/continuous domains). Inference techniques such as belief propagation, variational techniques or particle based approach can then be used to solve control and planning problems (solve MDPs/POMDPs).

The goals of this document are to

  1. provide a reference for factor graphs, message passing, and specific factor instantiations that can directly be translated to the implementation in the library,

  2. collect resources and pointers to existing inference software and related pages.

Current state

- basic low level computations with Gaussians etc are robust

- simple DBNs (HMMs, Kalman filters, switching state space models with MoGs) tested and work

- generic unit testing for factor implementations: Tables, Gaussians pass fully, MoGs and locally-linear-Gaussians pass approximately.

Software profile

The software will implement inference on factor graphs in a quite generic way, based on a virtual definition of factors. At least the following algorithms should thereby becomes special cases:

  • message passing algorithms such as Belief Propagation (BP), Loopy BP, Junction Tree, BP with any order of message exchanges

  • Expectation Propagation, Gaussian (Kalman) filters/smoothers, all message exchange schemes as above with Gaussian variables, mixture-of-Gaussians, and factors that implement `quasi-non-linear' dependencies between Gaussian variables (using local-linearization or the Unscented Transform)

  • Variational approximation, extraction of submodels (e.g., a single variable Markov chain from a DBN), Factorial Hidden Markov Models

  • Particle representations - not yet

Concerning the style of implementation, the software is meant for scientists that are familiar with the theoretical background. Special aims are

  • to provide a basic data structure for describing probabilistic models that is clear, transparent, and as general as possible

  • algorithms that realize inference algorithms in a generic way - trying to be rather independent of the type of variables and factors/conditional probabilities in the model (independent of whether they are tables, Gaussians, MoGs, etc)

  • algorithms that manipulate models, i.e., take the factor graph as an input and return a new one, e.g. to implement extraction of sub-models

Related pages

A great overview over existing inference software is given by

A selection of software is the following:

Bayes Net Toolbox (BNT)
- Kevin Murphy

http://bnt.sourceforge.net/

A Matlab toolbox.

Graphical Model Tool Kit (GMTK)
- Jeff Blimes

http://ssli.ee.washington.edu/~bilmes/gmtk/

Probabilistic Network Library (PNL)
- Intel

http://www.intel.com/technology/computing/pnl/index.htm

C++ version of the BNT. Kevin Murphy is involved.

Tayla Meltzer's c-inference
 

http://www.cs.huji.ac.il/~talyam/inference.html

This inspired me most: Simple and very clean implementation of various algorithms based on factor graphs (mainly used to represent Markov Random Fields

Rebel

http://choosh.ece.ogi.edu/rebel/

Inference and belief propagation in a nutshell

- This is a far too brief and maybe obscure intro to BP. Its origin was me trying to explain BP in a single lecture, without introducing first the JTA or forward-backward, etc. -

A graphical model is a way to notate the factorization of a joint distribution. Given $n$ random variables $X_i$, a graphical model determines a factorization
\begin{align}
P(X_{1:n}) = \prod_{i=1}^n P(X_i \Vert X_{\n_i})
\end{align}
where $\n_i$ are the parents of node $i$.

Inference is a generic kind of information processing: If we have information (= a distribution) over one variable and know about dependencies to other variables, then we'd like to compute information (a marginal distribution) about one of the other variables.

So, in the problem of inference, we assume that we know all the dependencies $P(X_i \Vert X_{\n_i})$ explicitly, say some expert came up with them as an assumed generative model of data. We can store the terms $P(X_i \Vert X_{\n_i})$ explicitly as tables or Gaussians in the computer. The goal is to find out what the marginals $P(X_i)$ over random variables (or pairs/groups of random variables) are.

Among many ways to do so, one way is to try to transform the joint from the form (1) into the form
\begin{align}
P(X_{1:n}) = \frac{\prod_k P(X_{\pi_k})}{\prod_{(kl)} P(X_{\pi_{kl}})}
\end{align}
Here, the joint is factorized in groups $\pi_k$ of variables (cliques) and has in the denominator factors which refer to the intersection of variables from two groups ( $\pi_{kl} = \pi_k \cap \pi_l$) (junctions). E.g., if we had four variables $A \to B \to C \to D$ where $\to$ indicates Markovian dependency, then (2) would read
\begin{align}
P(A,B,C,D) = \frac{P(A,B)~ P(B,C)~ P(C,D)}{P(B) \quad P(C)}
\end{align}
which certainly is a viable way to write the joint.

If we are able to represent the joint in form (2) we won because each factor $P(X_{\pi_k})$ is unconditional, i.e., it is a proper marginal over the group $X_{\pi_k}$ of variables. From this form of the joint we can directly read off any marginal that we are interested in - we solved the problem of inference.

The question is, how to get the joint from the form (1) for which we explicitly know all terms (stored as tables or so in the computer) to the form (2) for which we have no idea yet what the terms are. The scheme is the following: One first rewrites (1) with factors $f_k(X_{\pi_k})$ and $\p(X_{\pi_{kl}})$
\begin{align}
P(X_{1:n}) = \frac{\prod_k f_k(X_{\pi_k})}{\prod_{(kl)} \phi_{kl}(X_{\pi_{kl}})}
\end{align}
simply by initializing all $\p$'s to 1 and by grouping the terms $P(X_i \Vert X_{\n_i})$ in (1) in factors $f_k(X_{\pi_k})$ which depend only on a subset of variables, such that
\begin{align}
\prod_k f_k(X_{\pi_k}) = \prod_{i=1}^n P(X_i \Vert X_{\n_i}) ~.
\end{align}
This rewriting is trivial, except that we had to decide on the grouping $X_{\pi_k}$ of variables (the cliques). These $f$'s and $\p$'s are now the tables (or Gaussians or ...) that we store and manipulate in the computer.

Now we iteratively manipulate the $f$'s and $\p$'s until they end up what we hoped for, that is, until $f_k(X_{\pi_k})$ becomes $P(X_{\pi_k})$ and $\phi_{kl}(X_{\pi_{kl}})$ becomes $P(X_{\pi_{kl}})$. Two simple constraints on how we should manipulate the $f$'s and the $\p$'s give us a very good idea of how to do the manipulation:

$\bullet$ First, we always want to manipulate $f$'s and $\p$'s in such a way that the joint distribution (4) remains unchanged. That means: whenever we multiply some term to some $f$ we should multiply the same term to some $\p$ such that the total quotient in (4) is unchanged.

$\bullet$ Second, the manipulation should be such that if we reached our goal (i.e., $f_k(X_{\pi_k}) = P(X_{\pi_k})$ and $\phi_{kl}(X_{\pi_{kl}}) =
P(X_{\pi_{kl}})$ are proper marginals) then the manipulations shouldn't change anything at all anymore.

Here is one way to do the manipulation: Choose a junction $(kl)$ and
\begin{align}
&\phi_{kl}^{\text{new}}(X_{\pi_{kl}})
\gets \sum_{X_{\pi_k \setmi...
...\phi_{kl}^{\text{old}}(X_{\pi_{kl}})}~
f_l^{\text{old}}(X_{\pi_l})
\end{align}
Note that both constraints are fulfilled: the joint remains the same because of the quotient between new and old in (7); and if $f$'s and $\p$'s are already the desired proper marginals, then (6) will give $\phi_{kl}^{\text{new}}(X_{\pi_{kl}}) =
\phi_{kl}^{\text{old}}(X_{\pi_{kl}})$.

In very loose words: Equation (6) can be interpreted as pushing $\p_{kl}$ towards the marginal over $X_{kl}$ - well, at least $\p^{\text{new}}_{kl}$ is being assigned to what $f_k$ beliefs the marginal over $X_{kl}$ is: the summation in (6) is like computing the `marginal of $f_k$' over $X_{kl}$. So if $f_k$ already has a good belief of what the marginal over $X_{kl}$ is, then $\phi_{kl}$ is assigned to it and, in equation (7), the neighboring factor $f_l$ is being updated to absorb this information about $X_{kl}$ while at the same time keeping the total joint unchanged. This is a message exchange from $f_k$ to $f_l$ via the junction $\phi_{kl}$.

Unfortunately, there is a little flaw about this kind of message passing: Say $f_2$ doesn't have information yet - it is uniform. If $f_1$ sends a belief to $f_2$ and then, directly afterwards, $f_2$ would send exactly the same belief back to $f_1$. In effect, $f_1$ potentiates its own belief without any further information from outside (like a self-confirming effect). To avoid this, every factor should keep track what information it has send out previously. All information that flows in should then be considered relative to what has been send out previously (consider only the `news'). This is done by splitting up the factor $\phi_{12}$ in two factors $\m_{1\to 2}$ and $\m_{2\to 1}$, such that $\phi_{12} = \m_{1\to 2}~ \m_{2\to 1}$, and passing messages as
\begin{align}
& \m_{k\to l}^{\text{new}}(X_{\pi_{kl}})
\gets
\frac{1}{\m_{l\to ...
..._{k\to l}^{\text{old}}(X_{\pi_{kl}})}~
f_l^{\text{old}}(X_{\pi_l})
\end{align}
The difference between (8) and (6) is the devision by $\m_{l\to k}$. This means that the message send from $k$ to $l$ is $f_k$'s belief relative to the message $\m_{l\to k}$ that $f_l$ has previously been sending to $f_k$. As before, (9) ensures that the full joint remains unchanged.

Hopefully, if we pass messages between all factors for long enough, the scheme will converge to the desired representation (2) of the joint. [Completely neglected here: order of message passings, guarantees of the junction tree algorithm, forward-backward in HMMs, issues with loopy BP, approximate message passing, and literature given more rigorous introductions.]

The factor graph data structure and generic operations

The data structure - a factor graph

Let there be $n$ random variables $X_i$. We assume that the joint probability can be written as a product of $m$ factors $f_k(\vec
X_{\pi_k})$ where $\pi_k$ is a tuple of indices in $\{1,..,N\}$ specifies the variables the $k$th factor depends on
\begin{align*}
P(X_1,..,X_n) = \prod_{k=1}^m f_k(\vec X_{\pi_k}) ~.
\end{align*}

In the infer package, different types of factores are provided: tables, Gaussians, MoGs, Gaussians with quasi-non-linear dependencies, etc - see below. But the basic operations on the factor graph discussed in this section are largely independent of the type of factors.

Actually, providing the list of factors $f_k$ and the list of tuples $\pi_k$ is sufficient to fully determine a model. Based on this, some algorithms could analyze the dependencies, e.g., find a Junction Tree, and thereby decide on elimination orders etc. The result of such analysis would be arcs between factors. We include this in the basic data structure.

Let $G$ be a graph with $m$ nodes. The $k$th node is associated with the factor $f_k$. Additionally the graph contains edges $(kl)$ between nodes.

For each edge $(kl)$ we can compute the intersection of random variables that $f_k$ and $f_l$ depend on; we use $\pi_{kl}$ to denote the index tupel of this intersection,
\begin{align*}
\pi_{kl} = \pi_k \cap \pi_l ~.
\end{align*}
The random variables $\vec X_{\pi_{kl}}$ are the basis of communication between the factors $f_k$ and $f_l$ in belief propagation type algorithms - the factors will exchange their beliefs on the marginal of $\vec X_{\pi_{kl}}$.

As the final part of the basic data structure, we also associate two factors to each adge - one as a memory for forward messages and one for backward messages. We denote them by $\m_{k\to l}(\vec
X_{\pi_{kl}})$ and $\m_{l\to k}(\vec X_{\pi_{kl}})$.

The data structure consists of a graph with $m$ nodes and some edges. For each node we have $(\pi_k, f_k)$ and for each edge we have $(\pi_{kl}, \m_{k\to l}, \m_{l\to k})$.

As abbreviation we write
\begin{align*}
\vec X_k := \vec X_{\pi_k}
\comma \vec X_{kl} := \vec X_{\pi_{kl}} ~.
\end{align*}

States of the data structure

Let's define the data structure is...

junction complete
if there always is an edge $(kl)$, when $f_k$ and $f_l$ share variables

a junction tree
if a junction complete tree and all junctions $\pi_{kl}$ are disjoint (`running intersection property')

bipartite
if the factors can be separated in single variable factors and clique factors and only interconnections (no intra-connections) between the groups are present

in initial state
if all $\m_{k\to l}=\m_{l\to k}=1$ such that one can (redundantly) write
\begin{align*}
P(X_1,..,X_n)
= \frac{\prod_k f_k(\vec X_k)}
{\prod_{(kl)} \m_{k\to l}(\vec X_{kl})~ \m_{l\to k}(\vec X_{lk})} ~.
\end{align*}

consistent
with the original factors $f_k^\text{org}$ if the current factors $f_k$ and incoming messages $\m_{l\to k}$ fulfill
\begin{align*}
f_k(\vec X_k) = f_k^\text{org}(\vec X_k) \prod_{l:\exists(kl)} \m_{l\to k}(\vec X_{kl})
\end{align*}

Note that in this case
\begin{align*}
P(X_1,..,X_n)
= \prod_k f_k^\text{org}(\vec X_k)
= \frac{\prod_...
...
{\prod_{(kl)} \m_{k\to l}(\vec X_{kl})~ \m_{l\to k}(\vec X_{lk})}
\end{align*}
and the original joint is still correctly encoded in the data structure.

in equilibrium
if all factors and edges agree on the junction marginals, which is
\begin{align*}
\sum_{\vec X_{k \setminus l}} f_k(\vec X_k)
= \sum_{\vec X_{l \s...
... f_l(\vec X_l)
= \m_{k\to l}(\vec X_{kl})~ \m_{l\to k}(\vec X_{kl})
\end{align*}

The infer package includes routines to check the state of the data structure, i.e., whether it is still consistent or in equilibrium. These are an important tool for debugging.

Message exchange

Say $f_1$ and $f_2$ share some variables and let $\bar f_1$ and $\bar f_2$ be their marginals w.r.t. these shared variables. A (directed!) message exchange from a factor $f_1$ to another $f_2$ is (dashes mean reassignments):
\begin{align}
& \n = \frac{\bar f_1}{\m_{2\to 1}~ \m_{1\to 2}}\\
& \bar f_2'
=...
...c{\bar f_1}{\m_{2\to 1}}
= \m_{1\to 2}~ \frac{\bar f_2'}{\bar f_2}
\end{align}
The updates are written in many different forms to see alternatives for the implementation. Intuitively: The message $n$ reflects the marginal of $f_1$ leaving out all information that has previously been exchanged between $f_1$ and $f_2$. $\m_{1\to 2}$ `summarizes' all messages that have yet been send from $f_1$ to $f_2$. Alternative thinking: The quotients $f_1/\m_{2\to 1}$ and $f_2/\m_{1\to 2}$ that appear in these equations are the original (initial state) factors multiplies with all incoming messages except for the ones from $f_2$ and $f_1$, respectively,
\begin{align}
\frac{f_1}{\m_{2\to 1}} = f_1^\text{org}~ \prod_{l:\exists(1l)} \m_{l\to 1} ~.
\end{align}

Note that this message exchange preserves consistency because
\begin{align}
\frac{\bar f_1 ~ \bar f_2'}{\m_{1\to 2}'~ \m_{2\to 1}}
= \frac{\bar f_1~ \bar f_2}{\m_{1\to 2}~ \m_{2\to 1}}
\end{align}

Second, note that in equilibrium $\bar f_1 = \m_{1\to 2}~ \m_{2\to 1} = \bar f_2$ and thus $\n=1$ and the message exchange has no effect.

Approximate message exchange

[[ realized yet for MoGs ]]

On factor graphs with mixture of Gaussians (say a Markov chain/switching state space model), the marginal of a factor usually contains more Gaussian modes than a message can represent (the number of models would grow exponentially on a Markov chain). Hence, marginalization for a MoG is implemented approximate by collapsing all Gaussians that correspond to the same mixture component.

Approximate message exchange (e.g., expectation propagation / assume density filtering)

Factors or messages are sometimes limited to represent only a class of distributions (e.g., Gaussians in the continuous case), then one might not be able to do an exact message exchange because $f_2~ \nu$ is not in this class of distributions. The approximate message exchange then is
\begin{align}
f_2' \approx f_2~ \nu ~,
\end{align}
where the approximation refers to some Assumed Density Filtering approach. In this case, the message factors $\m_{1\to 2}$ should be updated according to
\begin{align}
\m_{1\to 2}' = \m_{1\to 2}~ \frac{\bar f_2'}{\bar f_2} ~,
\end{align}
because only this update preserves consistency ( $\m_{1\to 2}'=\m_{1\to
2}~ \nu$ would not because of the approximation).

Mean field approximations

[[ todo ]]

Given a graph with only two factors $X_1$ and $X_2$. A mean field approximation would compute the current b

Mean field approximations take the

Particle filtering

[[ Nando's paper: Fast Particle Smoothing: If I had a million particles

How can you rearrange particles from the forward and backward filter? ]]

Think of the Markov chain $X \to A \to B \to C \to Y$ with arbitrary continuous transitions and observations $X$ and $Y$. $P(A\vert X)$ is represented as a population (fwd particle filter), $P(Y\vert C)$ is represented as a population (backwrad particle filter). I want
\begin{align}
P(B \vert XY)
&= \frac{1}{P(Y\vert X)} \sum_{AC} P(Y\vert C)~ P(C...
...w^C_i \d_{x^C_i}~ P(x^C_i\vert B)~ P(B\vert x^A_j)~ w^A_j \d_{x^A_j}
\end{align}

Factors

Generic operations and testing

To implement inference algorithms, only a few basic operations need to be provided to manipulate factors ($f$'s and $g$'s). These are
\begin{align*}
&\text{multiply}(f,g) && f(\vec X) \gets f(\vec X)~ g(\vec X) \\ ...
...t{subMultiply}(f,g,\pi) && f(\vec X) \gets f(\vec X)~ g(\vec X_\pi)
\end{align*}
and for practical reasons also some more trivial methods
\begin{align*}
&f\text{ = }g && f(\vec X) \gets g(\vec X) \\
&\text{uniform}(f)...
... sample of }f(\vec X)\\
&f\text{ == }g && \text{boolean for $f=g$}
\end{align*}

In the implementation, a class factor will define these methods as virtuals. Factor implementations derived from this base class implement these methods.

A robust way to check the validity of a factor implementation is to define tests in the spirit of unit testing. We define that a valid factor has to pass the following randomized tests:

Let $\vec X_k$ and $\vec X_l$ be two cliques of variables, each with a random number $n$ of variables, each variable with random dimensionality/cardinality, and with sharing random variables $\vec
X_{kl}$.


\begin{test}[multiply, devide]
\item $f(X_k)$\ = random,~ $h(X_k)$\ = random
\item assert~ $f*h=h*f$\ ~ and ~ $(f * h)/h = f$\ ~ and ~ $(f*h)/f = h$
\end{test}


\begin{test}[message passing]
\item $f(X_k)$\ = random,~ $h(X_l)$\ = random
\ite...
...wd between $f$\ and $h$
\item assert $f$\ and $h$\ are in equilibrium
\end{test}


\begin{test}[marginal]
\item $f(X_k)$\ = random,~ $h(X_{kl})$\ = random
\item $g...
...pi_{kl})$
\item $g'$\ = marginal$(f',\pi_{kl})$
\item assert $h*g=g'$
\end{test}

Some factor implementations will pass these tests only approximately. E.g., a mixture of Gaussians can pass test 1 and 2, but not exactly test 3.

Factor types

Tables

$\vec X$ is discrete and $f(\vec X)$ is stored as a table. The operations are conceptually straightforward to implement.

In practice, to store a table factor one should always store a table $p(\vec X)$ scaled to sum to 1 and additionaly a log scaling $l$ such that
\begin{align*}
f(\vec X) = \exp(l)~ p(\vec X) ~.
\end{align*}
In that way, over- and underflow in the table is buffed in the scaling $l$. (In HMMs, e.g., $l$ will end up being the data log-likelihood.)

Gaussians

$\vec X$ is a continuous variable and $f(\vec X) = \NN_{\vec X}(a,A)$. Product, devision and marginal product are best implemented in the canonical representation $\NN[\bar a,\bar A]$. Only the marginal extraction requires matrix inversions, which are the Schur complements: The marginal precision matrix for $x_a$ for a joint precision matrix $[A,C;C^T,B]$ is $A' = A - C B^{-1} C^T$. See the appendix for details.

Again, in addition to the $\bar a$ and $\bar A$ we store a scaling $l$ such that
\begin{align*}
f(\vec X) = \exp(l)~ \NN_{\vec X}[\bar a,\bar A] ~.
\end{align*}

Mixture of Gaussians

[[implemented]]

Unscented Transform Gaussian factor

Two variables $X$ and $Y$, Gaussian over $X$, a generic (non-linear) but invertible(!) function from $X$ to $Y$. Using the UT to map from $X$ to $Y$ and back.

[[implemented]]

Linearized Function Gaussian factor

Two variables $X$ and $Y$, Gaussian over $X$, a generic (non-linear, perhaps redundant) function from $X$ to $Y$ which provides a local linearization (the Jacobian). The fwd mapping from $X$ to $Y$ is trivial using the linearization around the mean of $X$. The backward mapping is tricky in the case of redundancy:

  1. Linearize around the mean of $X$ (the must be some meaningful prior on $X$ otherwise it doesn't work!),
  2. redundantly pull back the Gaussian on $Y$, blowing up variance along the null space,
  3. execute multiplication with $X$,
  4. relinearize the function and the new mean, and repeat from (ii)

[[implemented]]

Switch factors for relational/logic models

Consider $n$ random variables $X_{1:n}$ of same type (in the same space) and a discrete random variable $i \in \{1,..,n\}$. Given $X_{1:n}$ and $i$, we have another random variable $Y$ which depends only on $X_i$:
\begin{align*}
Y \sim f(X_i) ~.
\end{align*}
In general terms, $Y$ depends on all $X_{1:n}$ and $i$ and we have $Y \sim P(Y\Vert X_{1:n},i)$. The definition above though captures much more structure in this dependency.

We introduce switch factors $f(X_{1:n},i,X)$ to `copy' the random variables $X_i$ into an additional buffer random variable $X$. That way we can then introduce a simple factor for $P(Y\vert X)$.

A switch factor is of the form
\begin{align*}
f(X_{1:n},i,X) = \left\{
\arr{cc}{g(i)~ h(X_{1:n}) & X=X_i \\ 0 & \text{otherwise}}
\right. ~,
\end{align*}
where $g$ and $h$ store marginal factors over $i$ and $X_{1:n}$ respectively.

It should be possible to implement all core methods.

[[work in progress]]

Decision tree factor

[[not approached yet]]

Particle factors

[[ideas, but not approached yet]]

Appendix: Gaussians

Definitions

 
In normal form
\begin{align}
&\NN_x(a,A) = \NN_a(x,A)\feed
&\quad= \frac{1}{\vert 2\pi A\vert^...
...1/2}}~ \exp -\half\{x^T A^{-1} x - 2 x^T
A^{-1} a + a^T A^{-1} a)\}
\end{align}
In canonical form (with $B = A^{-1},~ b = A^{-1}a,~ a = B^{-1} b$)
\begin{align}
&\NN_x[b,B]
= \NN_x(B^{-1} b,B^{-1})
= \frac{\exp-\half\{b^T B^{...
...}{\vert 2\pi B^{-1}\vert^{1/2}}~ \exp
-\half\{x^T~ B~ x - 2 x^T b\}
\end{align}
Non-normalized (e.g., for propagating likelihoods)
\begin{align}
&\oNN_x(a,A)
= \vert 2\pi A\vert^{1/2}~ \NN_x(a,A)
= \exp -\half\{(x-a)^T~ A^{-1}~ (x-a)\}
\end{align}

Matrices


\begin{align}
&(A^{-1} + B^{-1})^{-1} = A~ (A+B)^{-1}~ B
\comma
(A^{-1} - B^{-1}...
..._x A_x) \comma
\del_x A_x^{-1} = - A_x^{-1}~ (\del_x A_x)~ A_x^{-1}
\end{align}

Derivatives


\begin{align}
& \del_x \NN_x(a,A) = \NN_x(a,A)~ h \comma h:= A^{-1}(x-a)\\
& \d...
... + h^T (\del_\t a)
+ \frac{1}{2} h^T (\del_\t A) h \end{displaymath}\end{align}

Product


\begin{align}
\NN_x&(a,A) \cdot \NN_x(b,B) = \NN_x(c,C) \cdot \NN_a(b,A+B) \feed...
...\quad= \NN_x[a+b,A+B] \cdot \NN_{A^{-1}a}[A(A+B)^{-1}b,A(A+B)^{-1}B]
\end{align}

Division


\begin{align}
\NN_x&(a,A) ~\big/~ \NN_x(b,B) = \NN_x(c,C) ~\big/~ \NN_c(b,C+B) \...
...} - B^{-1} \\
\NN_x&[a,A] ~\big/~ \NN_x[b,B] \propto \NN_x[a-b,A-B]
\end{align}

Transformation


\begin{align}
& x' = F x + f \\
& \NN_x(a,A) = \vert F\vert~ \NN_{Fx+f}(Fa+f,FA...
...=F x + f) \comma P(x') = \frac{1}{\vert F\vert}~ P(x=F^{-1}(x' - f))
\end{align}

block-matrices


\begin{align}
&\left\vert\arr{cc}{A&C\\ D&B}\right\vert = \vert A\vert~ \vert\ba...
...1}&-\bar A^{-1} C B^{-1}\\ -B^{-1} D \bar A^{-1}&\bar B^{-1}}\right]
\end{align}

marginal & conditional:

 
We write $\{a,A\}$ to abbreviate $\NN_x(a,A)$ and $[a,A]$ to abbreviate $\NN_x[a,A]$. We give decompositions $P(y,x) = P(x) \cdot
P(y\vert x)$. Equations are true in the limit $\l\to \infty I$.
\begin{align}
\left[\arr{c}{x\\ y}\right] \sim \{ \left[\arr{c}{a\\ b}\right] , ...
... Fa+f}\right] ,~ \left[\arr{cc}{A&A^T F^T\\ FA&Q+F A^T F^T}\right]\}
\end{align}

\begin{align}
\left[\arr{c}{x\\ y}\right] \sim [ \left[\arr{c}{a\\ b}\right] , \...
...\\ h}\right] ,~
\left[\arr{cc}{A+H^T B^{-1} H&-H^T\\ - H&B}\right]]
\end{align}

Entropy


\begin{align}
H(\NN(a,A)) &= \frac{1}{2} \log \vert 2\pi e A\vert
\end{align}

Kullback-Leibler divergence


\begin{align}
&p=\NN_x(a,A) \comma q=\NN_x(b,B) \comma n = \text{dim}(x)
\comma...
...}
= \tr(B^{-1}A) + \tr(A^{-1}B) + (b-a)^T (A^{-1}+B^{-1}) (b-a) - 2n
\end{align}

$\l $-divergence
\begin{align}
2~ \kld[\l ]{p}{q}
&= \l ~ \kld{p}{\l p+(1\!-\!\l )q} ~+~ (1\!-\!\l )~ \kld{p}{(1\!-\!\l ) p + \l q}
\end{align}

For $\l =.5$: Jensen-Shannon divergence.

Log-likelihoods


\begin{align}
\log \NN_x(a,A)
&= - \frac{1}{2} \begin{displaymath}log\vert 2\pi...
...N_x(b,B) \log \NN_x(a,A)
&= -\kld{\NN(b,B)}{\NN(a,A)} - H(\NN(b,B))
\end{align}

Mixture of Gaussians

 
Collapsing a MoG into a single Gaussian
\begin{align}
&\argmin{b,B} \kld{\sum_i p_i~ \NN_x(a_i,A_i)}{\NN(b,B)} ~:
b=\sum_i p_i a_i \comma
B=\sum_i p_i (A_i + a_i a_i^T - b\, b^T)
\end{align}
Marginal
\begin{align}
&P(x,y) = \sum_i p_i~ \NN_{x,y}(\left[\arr{c}{a_i\\ b_i}\right],\l...
...i~ B_i^{-1} C_i^T \comma
f = \sum_i p_i~ B_i^{-1} b_i \comma
Q = ?
\end{align}

Kalman filter (fwd) equations


\begin{align}
& x'\vert x \sim \{F~x+f,~ Q\} \comma y\vert x \sim \{C~x,~ R\} \\...
...sim \{F~x+f, (Q^{-1} + C^T R^{-1} C)^{-1} C^T R^{-1}
(y'-C \hat x)\}
\end{align}

linear fwd-bwd equations without observations


\begin{align}
& \a_t(x) = \NN(a_t,A_t) = P(x \vert \text{start})
\comma \b(x) =...
...uely: $\b_{\tau+1} = \frac{1}{\vert F\vert}~ \NN(b_\tau,B_\tau)$\ )}
\end{align}

non-linear fwd-bwd equations without observations


\begin{align}
&P(x'\vert x) = \NN_{x'}(\phi(x),Q) \\
&\a_t(x) = \NN_x(a_t,A_t) ...
...ma c_{t,\tau} = C_{t,\tau}~ (A_t^{-1}~ a_t + B_\tau^{-1}~ b_\tau) ~.
\end{align}

About this document ...

Inference and planning on factor graphs
- technical notes -
- draft -

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -no_navigation index.tex

The translation was initiated by mt on 2007-07-18


mt 2007-07-18