LINEAR ESTIMATION
AND THE KALMAN FILTER
by Rick Martinelli, Haiku Laboratories, June 2008
Copyright ©, 2008
Minimum Variance Unbiased Estimate
The purpose of this paper is to develop the equations of
the linear Kalman filter for use in data analysis. Our primary interest is the smoothing and
prediction of financial market data, and the Kalman filter is one of the most
powerful tools for this purpose. It is a recursive algorithm that predicts
future values of an item based on the information in all its past values. It also is a least-squares procedure in that
its prediction-error-variance is minimized at each stage. Development of the equations proceeds in
steps, starting with ordinary least-squares estimation, to the Gauss-Markov
estimate, minimum variance estimation, recursive estimation and finally the
Kalman filter. Much of this development
was taken from references [1] and [2].
We begin with the simplest linear model y = Ax, where y is
an n ´ 1 vector of measurements y(k), x
is an m ´ 1 vector of unknown parameter values x(j) and A is
an n ´ m known model matrix. If n = m, and A is invertible, the
solution is given by x = A^{-1}y.
For least-squares we assume the model is over-determined, i.e., there
are more measurements than parameters to be determined, and so n >
m. Therefore A is not square and will not
have an inverse. Instead we seek the
least-squares estimate x of
x as the minimizer of the length of the residual vector
e = y – Ax.
This
number is the sum of the squares of the n separate residuals, and is denoted
|e|^{2} = (y – Ax)^{T}(y
–Ax),
where A^{T}
is the transpose of A. Minimizing this
quadratic with respect to x yields the normal equations
A^{T}A x = A^{T}y.
The matrix A^{T}A is symmetric and positive definite
so has an inverse. Hence we can write
x = (A^{T}A)^{-1}A^{T}y
This is
the OLS estimate of the parameter vector x.
If we define the m ´ n matrix L_{O} = (A^{T}CA)^{-1}A^{T},
then we have the simple expression
x = L_{O}y.
From y = Ax we find the OLS estimate of y to
be
y = AL_{O}y = A(A^{T}A)^{-1}A^{T}y =
Sy.
This
result has a nice geometric interpretation: the n ´ n matrix S is a projection matrix
and y is the orthogonal projection of y
onto the subspace generated by the columns of A. Projections also yield very effective methods
for smoothing data (see[3]).
We next consider the notion of weighted least-squares (WLS), where weights are assigned to
residuals according to the accuracy of the corresponding measurement. For example, if y(1) is considered more
accurate than y(2), we would like to make e(1) smaller than e(2). Thus we replace the sum of the squares of
residuals |e|^{2} with a weighted sum
|We|^{2} = (Wy – WAx)^{T}(Wy –WAx)
where W is an n ´ n matrix of weights. If the measurements are independent, W is
diagonal. If measurements are
correlated there will be off-diagonal elements.
In all cases, W is symmetric, positive definite. In the case of WLS, the normal equations
become
(WA)^{T}WAx = (WA)^{T}Wy.
and again
we can take the inverse and write
x = ((WA)^{T}WA)^{-1}(WA)^{T}Wy.
Since (WA)^{T} = A^{T}W^{T}, this
is
x = (A^{T}W^{T}WA)^{-1}A^{T}W^{T}Wy.
Letting C = W^{T}W, we have finally the WLS
estimate L_{W}
x= L_{W}y = (A^{T}CA)^{-1}A^{T}Cy.
Note that
OLS corresponds to a weight matrix W that is the identity matrix, that is, all
measurement weights are equal to one (L_{W} = L_{I}).
MINIMUM VARIANCE UNBIASED ESTIMTOR (MVU)
To determine the optimum choice for the matrix C in the
WLS estimate we need to add some statistical properties to the model. We assume the measurement vector y, and hence
the residual vector e = y - Ax, are random vectors. One required
statistical property is that the residuals are unbiased, that is, equally scattered around the estimate. This is expressed by E[e(k)] = 0 for each k,
where E is the expected
value operator.
Each residual also has a variance denoted V_{k}
= E[e(k)^{2}], and any correlations between residuals show up as covariances V_{km} =
V_{mk} = E[x(k)x(m)]. These
are collected in the covariance matrix
of e, defined by
R = E[ee^{T}].
The n ´ n matrix R has the variances V_{k} on
its diagonal, covariances V_{km} = V_{mk} as off-diagonal
elements and is positive-semidefinite. Matrix R will be used to calculate C, after
defining another required statistical property.
The second required property is that the linear estimator
L is also unbiased in the sense that the expected error in the estimate, x – x, is zero. Although true x values are unknown, we may
still write
0 = E[x – x] = E[x – Ly] = E[x – LAx – Le] =
E[(I – LA)x]
Hence, L is unbiased if LA = I, the m ´ m identity matrix. Here we’ve used the fact that
E[Le] = LE[e] = 0,
since
both E and L are linear. The estimate
errors also have a covariance matrix given by
P = E[(x – x)(x – x)^{T}].
P is n ´ n and
positive-semidefinite. Matrix P is called
the fundamental matrix of filtering
theory. The minimization of the trace of this matrix
provides the optimum choice for C mentioned above. The trace of P may be written
E|x-x|^{2} = E[(x – x)^{T}(x – x)]
It is this quantity we seek to minimize for MV
estimation. The solution to this
optimization problem was found long ago to be C = R^{-1} , and the
result is known as the Gauss-Markov
theorem:
Given y = Ax + e, with E[ee^{T}] = R,
the minimum-variance unbiased (MVU) estimator of x given y is
x = L_{G}y= (A^{T}R^{-1}A)^{-1}A^{T}R^{-1}y
with error covariance
P = (A^{T}R^{-1}A)^{-1}.
It is
easily seen that L_{G}A = I, showing that L_{G} is
unbiased. The general WLS estimator
L_{W} = (A^{T}CA)^{-1}A^{T}C
is also
unbiased since L_{W}A = I again.
This includes OLS where C = R = I, meaning that in OLS, the residuals are
tacitly assumed to be uncorrelated, and to have variance 1.
MINIMUM VARIANCE ESTIMATES (MV)
The previous estimates have assumed that x is a completely
unknown parameter vector, capable of assuming any value. In many situations however there is prior statistical
information about x, which may limit its possible values. This a
priori information may be exploited to produce an estimate, the minimum-variance (MV) estimate, with
smaller error variance than the MVU estimate.
We adopt the model y = Ax + e, where now x, y and e are
all random vectors, and A is a known, constant matrix. As before we assume E[ee^{T}] =
R. We now assume in addition that E[xx^{T}]
= Q, where Q is positive-semidefinite, and that x and e are uncorrelated, i.e.,
E[ex^{T}] = 0, the zero matrix.
Under these conditions we have the Minimum-Variance
Estimation theorem:
Given y = Ax + e, with E[xxT] = Q, E[ee^{T}] = R and E[ex^{T}] =
0, the linear estimate that minimizes the error variance E|x - x|^{2} is
x = L_{M}y = (A^{T}R^{-1}A
+ Q^{-1})^{-1}A^{T}R^{-1}y
with error covariance
P = (A^{T}R^{-1}A
+ Q^{-1})^{-1}
L_{M} is the linear minimum-variance estimator of
x given y. Notice that if Q^{-1} = 0, corresponding
to infinite variance of the a priori information, L_{M} reduces to L_{G}, the
MVU estimator.
Suppose that a MV estimate x_{0} of x has been found based on previous measurements and has error
covariance P_{0}. When new
measurements become available we want an update x_{1} of x_{0} based on the new information in
the measurements. Assume the new
measurements are in the form y = Ax + e used above. The best estimate of the new measurements y,
based on the past measurements only,
is y = Ax. The error in this estimate, y
– y, is called the innovation. It represents that part of the information in
y that is orthogonal
to Ax, i.e., could not be derived from x by a linear operation. The estimate x_{1} is given by the Recursive Estimation theorem:
If x_{0} is the MV estimate of x based
on some previous measurements and has error covariance P_{0}, and if
new measurements y become available, the updated MV estimate x_{1} of x based on all the
information is given by
x_{1} = x_{0} + P_{0}A^{T}(AP_{0}A^{T}
+ R)^{-1}(y – Ax_{0}) = L_{M}(y – Ax_{0})
with error covariance
P_{1}
= P_{0} - P_{0}A^{T}(AP_{0}A^{T} + R)^{-1}AP_{0}.
Notice
that the old estimate x_{0} is updated by the action of L_{M}
on the innovation y – Ax_{0}, and that the error covariance
decreases with the addition of the new information in y.
The recursive MV estimate described above assumes the
parameter vector x is random with fixed, but unknown mean value. However, many situations arise in which the
parameter vector changes with time, e.g., the position of a moving vehicle, or
the condition of the stock market. Thus
we speak of the state of the
parameter vector, or simply the state
vector for the system, and indicate its value at time k by x_{k}. In addition we assume the state vector is
updated by the state model consisting
of a linear update plus model noise:
x_{k+1} = Fx_{k} +
where F
is a known m ´ m matrix and
y_{k} = Ax_{k} + e_{k}
where we assume E[e_{k}] = 0 and E[e_{k}e_{k}^{T}] =
R_{k} for each k, with R_{k} positive-definite. The sequence e_{k}
is called measurement noise. Taken
together these two equations and their noise statistics constitute a discrete dynamical system.
A discrete random
processes is a sequence of random variables. The sequences y_{k}, e_{k}, x_{k}
and
1. prediction – estimation of future values of the process from
previous measurements
2. filtering - estimation of the current value of the process from
previous inexact measurements up to the present
3. smoothing - estimation of past values of the process from previous
inexact measurements up to the present
This article deals with the first two of these; for
smoothing applications see [3]. The Kalman filter provides a linear,
minimum-variance, recursive estimation procedure based on ideas in the previous
sections. The model for the Kalman
filter is a discrete dynamical system
x_{k} = Fx_{k-1} +
y_{k} = Ax_{k} + e_{k}
where F and A are known matrices, E[e_{k}] = 0,
E[e_{k}e_{k}^{T}] = R_{k}, E[u_{k}]
= 0 and E[u_{k}u_{k}^{T}] = Q_{k}, and
k=0,1,2,…. Assume also an initial
estimate x_{0} of x with error covariance P_{0}
= E[(x_{0} – x_{0})(x_{0} – x_{0})^{T}]. The notation x_{k|k-1} will indicate the estimate of x_{k} from
observations up to time k-1, and similarly for P_{k|k-1}. The Kalman
filter theorem is:
The linear MV estimate x_{k|k-1} of the state vector x_{k}
based on the previous k-1 measurements is updated according to the filter extrapolation equations:
x_{k|k-1} = F x_{k-1|k-1}
P_{k|k-1}
= FP_{k-1|k-1}F^{T} + Q_{k-1,}
Initial conditions are x_{0|-1} = x_{0}
and P_{0|-1} = P_{0}.
In addition to these we have the filter update equations:
x_{k|k} = x_{k|k-1} + G_{k }(y_{k}
– A x_{k|k-1})
P_{k|k}
= P_{k|k-1} - G_{k }AP_{k|k-1} ,
where the filter
gain is given by
G_{k}
= P_{k|k-1 }A^{T }(AP_{k|k-1}A^{T} + R_{k-1})^{-1}
The filter equations may be generalized by allowing the
matrices A and F to also change with time, i.e., A becomes A_{k} and F
becomes F_{k}. They may also be
simplified by requiring the measurements or the state or both to be
scalars. A system in which the noise
variances are estimated from the data is called an adaptive filter. Ultimately,
the Extended Kalman Filter allows
non-linear functions F(x_{k},u_{k}) and A(x_{k},e_{k})
to be used.
SUMMARY
The following table is intended to summarize the estimates
developed here in a way that shows how each is either a generalization of, or a
restriction on, one of the previous estimates.
Ordinary
least-squares |
x =
(A^{T}A)^{-1}A^{T}y |
Weighted
least-squares |
x=
(A^{T}CA)^{-1}A^{T}Cy. |
Gauss-Markov |
x
= (A^{T}R^{-1}A)^{-1}A^{T}R^{-1}y |
Minimum
variance |
x
= (A^{T}R^{-1}A + Q^{-1})^{-1}A^{T}R^{-1}y |
Recursive
minimum variance |
x_{1} = x_{0} + P_{0}A^{T}(AP_{0}A^{T} + R)^{-1}(y
– A x_{0}) |
Kalman
filter update |
x_{k|k} = x_{k|k-1} + P_{k|k-1}A^{T}(AP_{k|k-1}A^{T}
+ R_{k-1})^{-1} (y_{k} – Ax_{k|k-1}) |
1. Luenberger, D.G., Optimization
by Vector Space Methods, John Wiley and Sons, 1969.
2. Strang, G, Introduction
to Applied Mathematics, Wellesley-Cambridge Press, 1986.
3.
Martinelli, R, Data Smoothing by Vector Space Methods,
Haiku Laboratories, 2006.