**Data-Smoothing By Vector Space Methods**

by Rick Martinelli, Haiku
Laboratories June 2003

Copyright ©, 2003

Updated November 2006

Introduction

Vector Spaces and Least Squares

Projection Matrix

An
Example From The Stock Market

References

**INTRODUCTION**

Suppose a time-varying process x(t) is measured at
regular intervals, and it is known that the measurements are contaminated with
noise. If we let z(k) represent the measurement
at the kth interval, the situation may be represented by

(1)
z(k) = x(k)_{ }+ y(k) k =
1,2,...,N,

where {x(k)} is called the *process*, {z(k)} is
called the *data*, {y(k)} are samples from a zero-mean random sequence
with fixed variance σ^{2}, and N is the number of
measurements. The *data‑smoothing problem* is to estimate the
process {x(k)} from the data {z(k)}. This is a centuries old problem, first
addressed by the likes of Gauss and Legendre who formulated the first
least-squares estimates.

Figure 1 shows a set of
simulated measurements (the data) as green circles, together with the underlying
process (blue line). The process is a uniform sampling of N=81 points of
the exponential function exp(‑t^{2}). The noise sequence is machine-generated random
numbers from a uniform distribution with mean 0 and variance 1. The noise was treated to remove low frequency
components as described below.

Figure 1. Simulated data (green circles) and the underlying process. Data points are connected by dashed lines.

The problem is to retrieve a good approximation of the process from the noisy data. The usual approach is to impose some sort of smoothness condition on x, either by way of a process model, as in the case of least-squares estimates, or frequency restrictions as used in signal processing. More sophisticated approaches include a smoothness parameter that determines the trade-off between fidelity to a model and minimization of the error variance (see [1] for comparisons between approaches.).

The technique described
here is a modified basis-expansion method (see [1]) which combines features
from both the least-squares and signal processing approaches. Generic
modeling functions are chosen so that the process contains a small number of the
slowly varying (low frequency) components of the data, including the mean, and
the noise contains all other components. Then the least-squares estimate of the process
is calculated by applying a linear transformation, in the form of a *projection
matrix,* to the data. This results in a very
fast algorithm called a lowpass filter that may be used in real-time
applications.

Figure 2. Smoothed estimates (circles)
and the underlying process.

Figure 2 shows the
results of this procedure applied to the data in Figure 1. Smoothed
estimates are shown as individual circles; the process is shown as a solid
line. The modeling functions used here are

(2)
{1,
Cos(πt), Cos(2πt), Cos(3πt), Cos(4πt)}

The noise is found by
subtracting the estimates from the data and is shown in Figure 3.

Figure 3.
Noise estimate: Data minus smoothed estimate.

In the next section we
provide an outline of the aspects of vector spaces and the least-squares method
needed to develop the projection matrix. In section 3, the projection
matrix is calculated and some of its properties described. In the last
section, the method is used to smooth real stock-market data to extract the
dominant frequency components. The vector space techniques used here may be
found in references [2] and [3]; there is nothing new here except possibly the
explicit calculation of the projection matrix from the modeling functions in
Theorems 1 and 2.

The following is an
outline of the machinery required to calculate the projection matrix. An *N-dimensional
vector space* H is a set of N-tuples

v = (v(1), v(2),…, v(N))

where the v(k) are real numbers called the *coordinates*
of v. The zero vector in H is (0, 0,…, 0).

All vector spaces considered in this memo will be N-dimensional vector spaces for some fixed N. Such spaces are generalizations of the 2-dimensional plane where vectors are represented by arrows.

** Example**.
Equation (1) in vector notation is

(3)
z = x_{ }+ y.

Let S = {v_{1},
v_{2},…, v_{n}} be a set of vectors in
H. A *linear combination *of the vectors in S is a sum of the form

a_{1}v_{1}
+ a_{2}v_{2}
+…+ a_{n}v_{n}

where the a_{k} are real constants. The set of all such
linear combinations is called the *subspace generated by S* and is denote
by [S].

**Independence***linearly dependent*
on S if it is a member of [S], and *linearly independent* of S
otherwise. The set S is called *linearly independent* if each of its
members is linearly independent of the remaining members. A linearly
independent set S is a *basis* for [S].

** Example**.
The set S of polynomials

S = {V_{j}(k) =
(k/N)_{ }^{j }, k =1,…,N}

where j=0,…,N-1 is a basis for H.

** Example**.
The modeling functions in expression (2) above may be sampled to yield the set

S = {1, Cos(πkt/N), Cos(2πkt/N), Cos(3πkt/N), Cos(4πkt/M)}

which forms a basis for the 4-dimensional subspace [S] of H.

In particular, if the
set S contains N linearly independent vectors, then S is a *basis* for H,
and every vector in H is a linear combination of the basis vectors,

(3)
v = a_{1}v_{1} + a_{2}v_{2} +…+ a_{N}v_{N}

for come constants a_{k} .

**Vector Norm**: The notion of “vector length” can be generalized
from the 2-dimensional case by defining the (squared) *norm* of a vector v
= (v(1), v(2),…, v(N)) in H as

|v|^{2} = v(1)^{2} + v(2)^{2}
+ ,…,+ v(N)^{2}

This is called the *Euclidean*
norm, and H is called *Euclidean N-space*. A vector of norm one is
called a *unit* vector.

**Inner Product**: The notion of dot product between vectors
in two dimensions may be generalized to N dimensions by defining the *inner
product* of two vectors v and w in H as

<v, w> = v(1)w(1) + v(2)w(2) +…+ v(N)w(N).

The inner product is symmetric: <v, w> = <w, v>. When N=2 it may be shown that this definition is equivalent to the standard dot product given by

|v||w| Cosine( β )

where β is the angle between the vectors. Thus, the inner product in H generalizes the notion
of angle between vectors. With the inner product defined, H becomes a *Hilbert
space* called *Euclidean N-space*. Note that the inner product
<v, v> is the (squared) Euclidean* *norm of v.

Orthonormal Basis :
Two vectors are *orthogonal* in H, written v⊥w, whenever <v, w> = 0. Orthogonality
generalizes the 2-dimensional notion of perpendicularity. If two
vectors are orthogonal they are independent, but independent vectors may not be
orthogonal. An *orthonormal (ON) basis* for H is a set of N mutually
orthogonal unit vectors. Thus the set of vectors Ω = {u_{1}, ..., u_{N}} is an ON basis for H if and
only if <_{k}_{j}>
= 1 when k = j and is zero otherwise. A basic result in Hilbert space (used
below) is the *Schwartz inequality *(see[2, §64]
for a proof): for any two vectors v and w,

|<v, w>| ≤ |v||w|.

(Note the double use of "|" as absolute value on the left, and vector norm on the right.)

**Fourier Series**:
If v is a vector in H and Ω is an ON basis for H, then the orthonormal
property of Ω implies equation (3) can be written

v = <v, u_{1}> u_{1} + <v, u_{2}> u_{2}
+ ... <v, u_{N}> u_{N}, or

v = F_{1}u_{1} + F_{2}u_{2} + ... + F_{N}u_{N}.

This is called the *Fourier
series* representation of v and the inner products F_{n}
= <v, u_{n}> are the *Fourier coefficients* of v relative to
Ω.

** Example**: The Fourier series for z is

(4) z = <z, u_{1}> u_{1} + <z, u_{2}>
u_{2} + ... + <z, u_{N}> u_{N}

** Example**: Assume that x is well-approximated by a
linear combination of ON vectors {u

(5)
x ≈ F_{1}u_{1} + F_{2}u_{2}
+ ... + F_{M}u_{M}

where F_{j} = <x, u_{j}> are the
(unknown) Fourier coefficients of the process x.

**Orthogonal
Projection**: If Ω is an
ON basis for H, and G is the subspace generated by a proper subset of Ω,
say Ω’ = {u_{1}, ..., u_{M}}
from H, where M < N, then

v* = <v, u_{1}> u_{1} + <v, u_{2}>
u_{2} + ... <v, u_{M}> u_{M}

is called the *orthogonal projection* of v onto
G. Since H uses the Euclidean norm, v* corresponds to the *least-squares
estimate* of v using vectors in G. The subspace generated by the
remaining members of Ω, namely {u_{M+1}, ..., u_{N}}
is called the *orthogonal complement* of G and written G^{C}.
If v ∈ G and w ∈
G^{C}, then clearly v ⊥ w.
Any vector v in H can be written as the *direct sum* of two orthogonal
vectors v = u_{ }+ w, where u ∈ G and w ∈
G^{C}.

** Example**: Equation (3) tacitly assumes that the noise
y is orthogonal to x, and so z is a direct sum of x and y. Using
approximation (5) we have

x = <z, u_{1}> u_{1} + <z, u_{2}> u_{2}
+ ... <z, u_{M}> u_{M}

y = <z, u_{M+1}> u_{M+1} + <z, u_{M+2}> u_{M+2}
+ ... + <z, u_{N}> u_{N}

**Least Squares:** Equation (4) and approximation (5) may be
put in matrix form:

z ≈ Af + y,

where A is an N x M matrix with entries

a_{kj} = u_{j}(k),
k = 1,…,N, j=1,…,M

and f is an M-dimensional vector of the unknown
parameters f_{j}. We seek the estimate f* of f that minimizes

| z – Af |^{2}

It can be shown that the solution f* is the least-squares estimate of f given by

f* = (A'A)^{-1 }A'z

where the prime ' denotes matrix transpose (See [3], section 4.3). Hence the least-squares estimate of x given z is

x* = Pz

where

P = A(A'A)^{-1}A'

is an N x N matrix. This shows that the least-squares
estimate of x given z is a linear transformation
acting on z, and which depends only on the vectors {u_{j}}. It is
easily seen that matrix P has the property P^{2} = P, making
it a *projection* matrix. P is defined below as symmetric making it
in fact an orthogonal projection. The expression for P contains a matrix
inversion which is computationally costly. Using Fourier series
expansions however, allows us to easily calculate P and prove it is the
projection onto subspace G. This is done in the next section.

The following theorem calculates the projection matrix P.

*Theorem
1***.** Let H be the Hilbert space of N-tuples of real
numbers and let Ω = {u_{1}, ..., u_{N}}
be an ON basis for H. Let G be the subspace of H generated by the subset
Ω = {u_{1}, ..., u_{M},
M<N}. Let P be the symmetric, N x N matrix with components

(6)
P_{k}(m) = _{}u_{j}(k) u_{j}(m)

where k,m = 1,2,…,N and M < N. Then, for any vector v in H, Pv is the orthogonal projection of v onto G.

** Proof**: Write v as

v
= <v, u_{1}> u_{1} + <v, u_{2}> u_{2}
+ ... + <v, u_{N}> u_{N}.

It must be shown that Pv is equal to

v* = <v, u_{1}> u_{1} + <v, u_{2}>
u_{2} + ... + <v, u_{M}> u_{M}.

Choose a fixed row P_{k}
of P (equivalently, a coordinate v*(k) of v*). Regarding P_{k} as
a row vector and v as a column vector, matrix multiplication of P_{k}
with v can be written as an inner product, and gives the k-th coordinate of v*:

<P_{k}, v> = _{}P_{k}(n) v(n) = _{}u_{j}(k) u_{j}(n) v(n)

= _{}u_{j}(k) _{}u_{j}(n) v(n) = _{}u_{j}(k) <u_{j},v> = v*(k)

Since row k was arbitrary, this is true for all rows of P and so Pv = v*. ▌

**Gram-Schmidt**: In view of Theorem 1 it is essential to
have an ON basis for H that also serves as process model. Fortunately,
for every set of N independent vectors there is and equivalent set that is
orthonormal. Let {v_{1}, v_{2},…,
v_{N}} be a set of N independent vectors in H. Then a new set {u_{1}, u_{2}, ,…, u_{N}} may be generated that is an ON basis for H by
following the *Gram-Schmidt* procedure. First, normalize v_{1}
to obtain u_{1}:

u_{1} = v_{1} / |v_{1}|.

Clearly, u_{1} and v_{1} generate the same subspace.
Next, u_{2} is obtained from v_{2} by subtracting any
contribution to v_{2} by u_{1}, and then normalizing the result
again:

w_{2} = v_{2} - <v_{2}, u_{1}> u_{1}

u_{2} = w_{2} / |w_{2}|.

From
<w_{2}, u_{1}> = 0 , it follows that u_{1} ⊥ u_{2}. Furthermore u_{1} and
u_{1} generate the same subspace as v_{1} and v_{2}.
Continuing by induction

w_{n} = v_{n} - _{}<v_{n}, _{k}_{k}

and

u_{n} = w_{n} / |w_{n}|.

for each n ≤ N. By the
same reasoning, the sets {u_{1}, ..., u_{M}}
and {v_{1}, v_{2},…, v_{M}} generate the same subspace
for each choice of M ≤ N, and _{k}_{j} whenever j ≠
k.

** Example**: To smooth the data in Figure 1, the
Gram-Schmidt procedure was applied to modeling functions

{Cos((n-1)πt), n=1,2,…,81}

These functions were first sampled to produce the set of vectors

(7)
{v_{n}(k) =
Cos((n-1)πk/81), n=1,2,…,81},

where k = 0, 1, …, 80, then
orthonormalized to get the basis vectors {_{k}_{1} = 1/9. Roughly speaking, it
is u_{1} that captures the mean value
of the data, while u_{2} models linear trend. The other vectors help model varying
degrees of curvature in the process.

Figure 4. Graphs of the first six orthonormal basis vectors generated by the basis vectors in Equation (7).

** Data-Smoothing
Example**: The
process vector x shown in Figure 1 was calculated as 81 points of a uniform
sampling of exp(-t

**Row Norms**: For each row P_{k} of P, the
computation of x_{k} as <P_{k}, z> is a weighted average
of the coordinates of z. Figure 5 is a plot of five of these rows, for
the P matrix used to obtain the smoothed estimates in Figure 2. An
animation of all 81 functions would show a traveling wave, beginning high on
the left, near 0.4, dropping quickly as it moves to the right with a peak
height around 0.2, and finally building up high at the right end. The
symmetry of these curves is due to the fact that P itself is symmetric by
definition and that P^{2} = P.

Figure 5. Rows 1, 21, 41, 61, 81 of the P matrix used to obtain the estimates in Figure 2.

The rows P_{k}
of P, regarded as vectors in H, enjoy another property, namely:

** Theorem 2**: The square roots of the diagonal entries in
P are the norms of their respective rows, i.e., |P

** Proof**:

|P_{k}|^{2} =<P_{k}, P_{k}> = _{}P_{k}(j) P_{k} (j) = _{}P_{k}(j)_{} u_{n}(k)
u_{n}(j)

= _{}u_{n}(k)_{} P_{k}(j) u_{n}(j) = _{}u_{n}(k)
P_{k}(n) = P_{k}(k) ▌.

**Bias-Variance Tradeoff**: The squared-norms |P_{k}|^{2} are important because they represent an upper
bound on the smoothing capability of their respective rows. Define the *estimation*
*error *due to P by

x* - x = Pz – x = Px + Py – x = b + Py,

where the *bias vector* is defined by

b = Px – x,

and Py is the *residuals vector*. The
standard *bias-variance decomposition* of the squared-error is

|x* - x|^{2} = |b + Py|^{2} = |b|^{2} + |Py|^{2} +
2|<b, Py>|.

The quantity |Py|^{2} is called *residual variance* due to P; it is
the amount of noise remaining in the estimate. The quantity |b|^{2} is called the *bias* due to P and indicates
any lack of fit by the modeling functions. The last term represents any
correlation between the bias and residual variance, and is usually assumed to
be zero. Notice that, if y is truly orthogonal to G then residual
variance is zero, and if x is truly a member of G then P is unbiased and b =
0.

There is a *bias-variance
tradeoff* inherent in this and similar procedures: as model complexity
increases, the bias decreases while the variance increases (see reference [1]
for more on this). In the current procedure, model complexity is
indicated by the dimension of the subspace G. More modeling functions in
G means less bias due to P at the cost of increased residual variance.

**Noise-Ratio**: Define a vector R_{G}, called the *noise-ratio*
due to G, coordinate-wise by

R_{G}(k) =
(x(k)* - x(k))^{2}/ |y|^{2}.

R_{G}(k) represents the fraction of noise left in the
estimate by the kth row of P. Its dependence on the number, the length
and nature of the basis vectors is indicated by the subscript G. Using
the bias-variance decomposition coordinate-wise we have

(x(k)* - x(k))^{2} = (b(k) + <P_{k},y>)^{2}
= b(k)^{2} + <P_{k},y>^{2} ≤ b(k)^{2}
+ |P_{k}|^{2} |y|^{2}

where Schwartz was used to obtain the inequality, and
the residuals and bias vector are assumed uncorrelated. Thus, in the
unbiased case, the squared-norms |P_{k}|^{2} are upper bounds for the
noise-ratios:

R_{G}(k) ≤ |P_{k}|^{2}

for each k.

Figure 6. Upper bounds for noise-ratios vs. row number for rows of P generated by the basis vectors in Equation (7).

Figure 6 shows plots of
|P_{k}|^{2}
versus k for the basis vectors in equation (7), and subspace dimensions 1
through 6. The first curve, labeled Dim 1, corresponds to a subspace G of
dimension 1 generated by u_{1}. The second curve, labeled Dim 2,
corresponds to a G of dimension 2 generated by u_{1} and u_{2},
with similar correspondences for the other curves. Notice that, for fixed
k, |P_{k}|^{2}
increases with the dimension of G, and that, except in the Dim 1 case, is
largest near the edges. These curves look the same for different values
of N, only the scale changes, uniformly decreasing as N increases. At the
midpoint, the value for Dim 1 equals that for Dim2; similarly the value for Dim
3 equals that for Dim4, and so on for higher dimensions. Figure 7 shows
Upper bounds for midpoint noise-ratios vs. number of data points for the basis
vectors in equation (7) and for subspaces G of odd dimension 1 through 9.

Figure 7. Upper bounds for midpoint noise-ratios vs. number of data points, for subspaces of dimension 1, 3, 5 ,7 and 9 generated by the basis vectors in Equation (7).

**Fourier Coefficients**: Figures 1 and 2 provide visual evidence for
the ability of the projection matrix to accurately extract the underlying
process from simulated noisy data. Another way to see this effect is to
look at the Fourier coefficients of x, y and z. Figure 8 shows the
Fourier coefficients F_{n} for the data in
Figure 1, the estimate in Figure 2 and the noise in Figure 3, relative to the
basis vectors in equation (7). Numbers on the x-axis are values of n
called the *coefficient number*. The effectiveness of this procedure
can be seen in the clean separation of the data coefficients into the
process and the noise coefficients. The process retains the first five coefficients,
corresponding to the five ON basis vectors used in the process model.

Figure 8a. Fourier coefficients of the data in Figure 1

Figure 8b. Fourier coefficients of the process estimate in Figure 2

Figure 8c. Fourier coefficients of the noise estimate in Figure 3

**Subspace Dimension**: We have so far avoided the question of
which dimension to choose. In the simulated case above, we have the truth,
as seen in Figure 1, so can judge that any dimension less than 5 produces a
biased estimate. The situation is different in real-world applications,
where the truth is never known. Selecting the dimension size is equivalent to
choosing kernel length in kernel smoothing ([1], Chapter 6), the size of the
multiplier of the penalty term in penalized fit methods [([1], Chapter 7), and
the frequency cutoff in low-pass digital
filtering. A low-pass filter is a linear device that removes
frequencies above the cutoff frequency, which is exactly what the matrix P did
as seen in Figures 8. Each coefficient F_{n}
corresponds to a single, cyclic data-component of fixed period L. If
n > 1, then the period of the corresponding component is given by

L = 2N/(n-1)

where N is the number of points. For example, the coefficient at n = 3 corresponds to a component with a period of 40.5 data points. In this memo our choice of subspace dimension is guided by the Fourier coefficients, as seen in the following example.

**AN EXAMPLE FROM THE STOCK MARKET**

Closing prices for
stocks and other market items are reported at the end of each trading day,
along with the number of shares traded, known as the item’s volume. The
product of these two numbers, called *price-volume*, is an ^{-8}) for Microsoft (MSFT) ending 10/06/06; data
points are indicated by circles and connected by straight lines.

Figure 9. Price-volume
(x10^{-8}) for Microsoft ending 10/06/06.

Some market analysts search for cyclic patterns in this type of data to help determine market direction. Figure 10 shows Fourier coefficients for the data, which has been adjusted to have mean zero. Removing the mean is equivalent to setting the first coefficient to zero, which emphasizes the other values.

Figure 10. Fourier coefficients for the data in Figure 9 after mean deletion.

There are dominant values
at n=3, corresponding to a slowly varying component, and at n=8, the largest
value. In addition there are some “natural gaps”, where coefficients have
small values,

Figure 11. Data and smoothed estimates using basis vectors in equation (7), and dimensions 4, 9, 12 and 25.

The dominant component at n=8 can be isolated by subtraction and is shown in Figure 12. Its period is seen to be about 23 days, which corresponds to the time between the major peaks in Figure 9. The analyst can use any or all of the smoothings or isolated components for predicting market direction.

Figure 12. Data component corresponding to coefficient n=8.

1. Hastie,
T, Tibshirani, R, and Friedman, J, *The** Elements of Statistical Learning*,
Springer, 2001.

2. Halmos, P, *Finite
Dimensional Vector Spaces*, Van Nostrand, 1993.

3. Luenberger, D.G., *Optimization
by Vector Space Methods*, Wiley, 1969.