** Data-Smoothing By Vector Space Methods**

by Rick Martinelli, Haiku Laboratories June 2003

Copyright ©, 2003

Updated November 2006

Introduction

Vector Spaces and Least Squares

Projection Matrix

An Example From The Stock Market

References

**INTRODUCTION**

Suppose a time-varying process x(t) is measured at regular intervals, and it is known
that the measurements are contaminated with noise. If we let z(k) represent the measurement at the
kth interval, the situation may be
represented by

(1) z(k) =
x(k)_{ }+ y(k) k = 1,2,...,N,

where {x(k)} is called the *process*, {z(k)} is called the *data*,
{y(k)} are samples from a zero-mean random sequence with fixed variance σ^{2},
and N is the number of measurements. The *data‑smoothing problem* is to estimate the process {x(k)} from the data
{z(k)}. This is a centuries old problem,
first addressed by the likes of Gauss and Legendre who formulated the first least-squares
estimates.

Figure 1 shows a set of simulated measurements (the data) as
green circles, together with the underlying process (blue line). The process is a uniform sampling of N=81
points of the exponential function exp(‑t^{2}). The noise sequence is machine-generated
random numbers from a uniform distribution with mean 0 and variance 1. The noise was treated to remove low frequency
components as described below.

Figure 1. Simulated data (green circles) and the underlying process. Data points are connected by dashed lines.

The problem is to retrieve a good approximation of the process from the noisy data. To solve the problem, additional information about the nature of the process and/or the noise is required. The usual approach is to impose some sort of smoothness condition on x, either by way of a process model, as in the case of least-squares estimates, or frequency restrictions as used in signal processing. More sophisticated approaches include a smoothness parameter that determines the trade-off between fidelity to a model and minimization of the error variance (see [1] for comparisons between approaches.).

The technique described here is a modified basis-expansion
method (see [1]) which combines features from both the least-squares and signal
processing approaches. Generic modeling
functions are chosen so that the
process contains a small number of the slowly varying (low frequency)
components of the data, including the mean, and the noise contains all other
components. Then the least-squares
estimate of the process is calculated by applying a linear transformation, in
the form of a *projection **matrix,* to the data. This results in a very fast algorithm that
may be used in real-time applications.

Figure 2. Smoothed estimates (circles) and the underlying process.

Figure 2 shows the results of this procedure applied to the data in Figure 1. Smoothed estimates are shown as individual circles; the process is shown as a solid line. The modeling functions used here are

(2) {1, Cos(πt), Cos(1πt), Cos(2πt), Cos(3πt), Cos(4πt)}

The noise is found by subtracting the estimates from the data and is shown in Figure 3.

Figure 3. Noise estimate: Data minus smoothed estimate.

In the next section we provide an outline of the aspects of vector spaces and the least-squares method needed to develop the projection matrix. In section 3, the projection matrix is calculated and some of its properties described. In the last section, the method is used to smooth real stock-market data to extract its dominant frequency component. The vector space techniques used here may be found in references [2] and [3]; there is nothing new here except possibly the explicit calculation of the projection matrix from the modeling functions in Theorems 1 and 2.

The following is an
outline of the machinery required to calculate the projection matrix. An *N-dimensional vector space* H is a set of
N-tuples

v = (v(1), v(2),…, v(N))

where the v(k) are real
numbers called the *coordinates* of
v. The zero vector in H is (0, 0,…, 0).

All vector spaces considered in this memo will be N-dimensional vector spaces for some fixed N. Such spaces are generalizations of the 2-dimensional plane where vectors are represented by arrows.

** Example**. Equation (1) in vector notation is

(3) z
= x_{ }+ y.

Let S = {v_{1}, v_{2},…, v_{n}}
be a set of vectors in H. A *linear
combination *of the vectors in S is a sum of the form

a_{1}v_{1} + a_{2}v_{2} +…+ a_{n}v_{n}

where
the a_{k} are real constants. The set of all such
linear combinations is called the *subspace generated by S* and is denote
by [S].

**Independence**: A vector
in H is *linearly dependent* on S if it is a member of [S], and *linearly
independent* of S otherwise. The set
S is called *linearly independent* if each of its members is linearly
independent of the remaining members. A
linearly independent set S is a *basis* for [S].

** Example**. The set S of polynomials

S = {V_{j}(k) = (k/N)_{ }^{j },
k =1,…,N}

where j=0,…,N-1 is a basis for H.

** Example**. The modeling functions in expression (2) above may be sampled to
yield the set

S = {1, Cos(πkt/N), Cos(2πkt/N), Cos(3πkt/N), Cos(4πkt/M)}

which forms a basis for the 4-dimensional subspace [S] of H.

In particular, if the
set S contains
N linearly independent vectors, then S is a *basis* for H, and every vector in H is a linear combination of the basis vectors,

(3) v
= a_{1}v_{1} + a_{2}v_{2} +…+ a_{N}v_{N}

for
come constants a_{k} .

**Vector Norm**: The notion of “vector length” can be generalized from the
2-dimensional case by defining the (squared) *norm* of a vector v = (v(1), v(2),…, v(N)) in H as

|v|^{2} = v(1)^{2} + v(2)^{2} + ,…,+
v(N)^{2}

This is called the *Euclidean* norm, and H is called *Euclidean N-space*. A vector of norm one is called a *unit* vector.

**Inner Product**: The notion of dot product
between vectors in two dimensions may be generalized to N dimensions by
defining the *inner product* of two vectors v and w in H as

<v, w> = v(1)w(1) + v(2)w(2) +…+ v(N)w(N).

The inner product is symmetric: <v, w> = <w, v>. When N=2 it may be shown that this definition is equivalent to the standard dot product given by

|v||w| Cosine( β )

where β is the angle between the vectors. Thus, the inner product in H
generalizes the notion of angle between vectors. With the inner product defined, H becomes a *Hilbert space* called *Euclidean N-space*. Note that the inner product <v, v> is the (squared) Euclidean* *norm of v.

Orthonormal Basis : Two vectors are *orthogonal* in H, written v⊥w, whenever <v, w> = 0. Orthogonality generalizes the 2-dimensional notion of perpendicularity. If two vectors are orthogonal they are independent, but independent vectors may not be orthogonal. An *orthonormal (ON) basis* for H is a set of N mutually orthogonal unit vectors. Thus the set of vectors Ω = {u_{1}, ..., u_{N}} is an ON basis for H if and only if <u_{k},u_{j}> = 1 when k = j and is zero otherwise.
A basic result
in Hilbert space (used below) is the *Schwartz
inequality *(see[2, §64] for a proof): for any two vectors v and w,

|<v, w>| ≤ |v||w|.

(Note the double use of "|" as absolute value on the left, and vector norm on the right.)

**Fourier Series**: If v is a vector in H and Ω is an ON basis for H, then the orthonormal property of Ω implies equation (3) can be written

v =
<v, u_{1}> u_{1} +
<v, u_{2}> u_{2} + ...
<v, u_{N}> u_{N}, or

v = F_{1}u_{1} + F_{2}u_{2} + ... + F_{N}u_{N}.

This is called the *Fourier series* representation of v and the inner products
F_{n} = <v, u_{n}> are the *Fourier
coefficients* of v relative to Ω.

** Example**: The Fourier series for z is

(4) z = <z, u_{1}> u_{1} +
<z, u_{2}> u_{2} + ...
<z, u_{N}> u_{N}

** Example**: Assume that x is well-approximated by a
linear combination of ON vectors {u

(5) x ≈ F_{1}u_{1} + F_{2}u_{2} + ... + F_{M}u_{M}

where F_{j} = <x, u_{j}> are the
(unknown) Fourier coefficients of the process x.

**Orthogonal Projection**: If Ω is an ON basis for H, and G is the subspace generated by a proper subset of Ω, say Ω’ = {u_{1}, ..., u_{M}} from H, where M
< N, then

v* = <v, u_{1}> u_{1} +
<v, u_{2}> u_{2} + ...
<v, u_{M}> u_{M}

is called the *orthogonal projection* of v onto G. Since H uses the Euclidean norm, v*
corresponds to the *least-squares estimate* of v using vectors in G. The subspace generated by the remaining members of Ω, namely {u_{M+1}, ..., u_{N}} is called the *orthogonal
complement* of G and written G^{C}. If v ∈ G and w ∈ G^{C}, then clearly v ⊥ w. Any vector v in H can be written as the *direct sum* of two orthogonal vectors v = u_{ }+
w, where u ∈ G and w ∈ G^{C}.

** Example**: Equation (3) tacitly assumes that the noise y
is orthogonal to x, and so z is a direct sum of x and y. Using approximation (5) we have

x = <z, u_{1}> u_{1} +
<z, u_{2}> u_{2} + ...
<z, u_{M}> u_{M}

y = <z, u_{M+1}> u_{M+1} +
<z, u_{M+2}> u_{M+2} + ...
+ <z, u_{N}> u_{N}

**Least Squares:** Equation (4) and approximation (5) may be put
in matrix form:

z ≈ Af + y,

where A is an N x M matrix with entries

a_{kj} = u_{j}(k), k = 1,…,N, j=1,…,M

and f is an M-dimensional vector
of the unknown parameters f_{j}. We seek the estimate f* of f that
minimizes

| z – Af |^{2}

It can be shown that the solution f* is the least-squares estimate of f given by

f* = (A'A)^{-1 }A'z

where the prime ' denotes matrix transpose (See [3], section 4.3). Hence the least-squares estimate of x given z is

x* = Pz

where

P
= A(A'A)^{-1}A'

is an N x N matrix. This shows that the least-squares estimate of
x given z is a linear transformation acting on z, and which depends only on the
vectors {u_{j}}. It is easily seen that matrix P has the property P^{2} = P,
making it a *projection* matrix. P
is defined below as symmetric making it in fact an orthogonal projection. The expression for P contains a matrix
inversion which is computationally costly. Using Fourier series expansions however, allows us to easily calculate P
and prove it is the projection onto subspace G. This is done in the next section.

The following theorem calculates the projection matrix P.

*Theorem 1***.** Let
H be the Hilbert space of N-tuples of real numbers and let Ω = {u_{1}, ..., u_{N}} be an ON basis
for H. Let G be the subspace of H
generated by the subset Ω = {u_{1}, ..., u_{M}, M<N}. Let P be the symmetric, N x N matrix
with components

(6) P_{k}(m)
= _{} u_{j}(k) u_{j}(m)

where k,m
= 1,2,…,N and M < N. Then, for any vector v in H, Pv is the orthogonal projection of v onto G.

** Proof**: Write v as

v = <v, u_{1}> u_{1} +
<v, u_{2}> u_{2} + ...
+ <v, u_{N}> u_{N}.

It must be shown that Pv is equal to

v* = <v, u_{1}> u_{1} +
<v, u_{2}> u_{2} + ... +
<v, u_{M}> u_{M}.

Choose a fixed row P_{k} of P (equivalently, a coordinate v*(k)
of v*). Regarding P_{k} as a row
vector and v as a column vector, matrix multiplication of P_{k} with v
can be written as an inner product, and gives the k-th coordinate of v*:

<P_{k},
v> = _{} P_{k}(n) v(n) = _{}_{} u_{j}(k) u_{j}(n) v(n)

= _{}u_{j}(k) _{}u_{j}(n) v(n) = _{}u_{j}(k) <u_{j},v> = v*(k)

Since row k was arbitrary, this is true for all rows of P and so Pv = v*. ▌

**Gram-Schmidt**: In view of Theorem 1 it is essential to have
an ON basis for H that also serves as process model. Fortunately, for every set of N independent
vectors there is and equivalent set that is orthonormal. Let {v_{1}, v_{2},…, v_{N}} be
a set of N independent vectors in H. Then a new set {u_{1}, u_{2}, ,…, u_{N}} may be generated that is
an ON basis for H by following the *Gram-Schmidt* procedure. First, normalize v_{1} to obtain u_{1}:

u_{1} = v_{1} / |v_{1}|.

Clearly, u_{1} and v_{1} generate the same subspace. Next, u_{2} is obtained from v_{2} by subtracting any contribution to v_{2} by u_{1}, and then
normalizing the result again:

w_{2} = v_{2} - <v_{2}, u_{1}> u_{1}

u_{2} = w_{2} / |w_{2}|.

From <w_{2}, u_{1}> = 0 , it follows that u_{1} ⊥ u_{2}. Furthermore u_{1} and u_{1} generate the
same subspace as v_{1} and v_{2}. Continuing by induction

w_{n} = v_{n} - _{}<v_{n}, u_{k}> u_{k}

and

u_{n} = w_{n} / |w_{n}|.

for
each n ≤ N. By the same reasoning, the sets {u_{1}, ..., u_{M}} and {v_{1},
v_{2},…, v_{M}} generate the same subspace for each choice of M ≤ N, and u_{k} ⊥ u_{j} whenever j ≠ k.

** Example**: To smooth
the data in Figure 1, the Gram-Schmidt procedure was applied to modeling
functions

{Cos((n-1)πt), n=1,2,…,81}

These functions were first sampled to produce the set of vectors

(7) {v_{n}(k)
= Cos((n-1)πk/81), n=1,2,…,81},

where k = 0, 1, …, 80, then orthonormalized to get the
basis vectors {u_{k}}. The first six of these are shown as curves in
Figure 4. (For better visibility, individual
points have been connected to produce continuous color-coded curves.) The
constant vector is u_{1} = 1/9. Roughly speaking, it is u_{1} that
captures the mean value of the data, while u_{2} models linear trend. The other vectors help model varying degrees
of curvature in the process.

Figure 4. Graphs of the first six orthonormal basis vectors generated by the basis vectors in Equation (7).

** Data-Smoothing
Example**: The process vector x shown in
Figure 1 was calculated as 81 points of a uniform sampling of exp(-t

**Row Norms**: For each row P_{k} of P, the computation
of x_{k} as <P_{k}, z> is a weighted average of the
coordinates of z. Figure 5 is a plot of
five of these rows, for the P matrix used to obtain the smoothed estimates in
Figure 2. An animation of all 81
functions would show a traveling wave, beginning high on the left, near 0.4,
dropping quickly as it moves to the right with a peak height around 0.2, and
finally building up high at the right end. The symmetry of these curves is due to the fact that P itself is
symmetric by definition and that P^{2} = P.

Figure 5. Rows 1, 21, 41, 61, 81 of the P matrix used to obtain the estimates in Figure 2.

The rows P_{k} of P, regarded as vectors in H, enjoy another
property, namely:

** Theorem 2**: The square
roots of the diagonal entries in P are the norms of their respective rows,
i.e., |P

** Proof**:

|P_{k}|^{2} =<P_{k}, P_{k}> = _{} P_{k}(j) P_{k} (j) = _{} P_{k}(j)_{} u_{n}(k) u_{n}(j)

= _{} u_{n}(k)_{} P_{k}(j) u_{n}(j) = _{} u_{n}(k) P_{k}(n) = P_{k}(k) ▌.

**Bias-Variance
Tradeoff**: The squared-norms |P_{k}|^{2} are important because they represent an upper bound
on the smoothing capability of their respective rows. Define the *estimation* *error *due
to P by

x* - x = Pz – x = Px + Py – x = b + Py,

where the *bias vector* is defined by

b = Px – x,

and Py is the *residuals vector*. The standard *bias-variance decomposition* of the squared-error is

|x* - x|^{2} = |b + Py|^{2} = |b|^{2} + |Py|^{2} +
2|<b, Py>|.

The quantity |Py|^{2} is called *residual variance* due to P; it is
the amount of noise remaining in the estimate. The quantity |b|^{2} is
called the *bias* due to P and
indicates any lack of fit by the modeling functions. The
last term represents any correlation between the bias and residual variance,
and is usually assumed to be zero. Notice that, if y is truly orthogonal to G then residual variance is
zero, and if x is truly a member of G then P is unbiased and b = 0.

There is a *bias-variance tradeoff* inherent
in this and similar procedures: as model complexity increases, the bias
decreases while the variance increases (see
reference [1] for more on this). In the
current procedure, model complexity is indicated by the dimension of the
subspace G. More modeling functions in G
means less bias due to P at the cost of increased residual variance.

**Noise-Ratio**: Define a vector R_{G}, called the *noise-ratio* due to G, coordinate-wise by

R_{G}(k) = (x(k)* - x(k))^{2}/ |y|^{2}.

R_{G}(k) represents the fraction of noise left in the estimate
by the kth row of P. Its dependence on
the number, the length and nature of the basis vectors is indicated by the
subscript G. Using the bias-variance
decomposition coordinate-wise we have

(x(k)* - x(k))^{2} = (b(k) + <P_{k},y>)^{2} = b(k)^{2} + <P_{k},y>^{2} ≤ b(k)^{2} + |P_{k}|^{2} |y|^{2}

where Schwartz was used to obtain the inequality, and the residuals and
bias vector are assumed uncorrelated. Thus,
in the unbiased case, the squared-norms |P_{k}|^{2} are upper bounds for the noise-ratios:

R_{G}(k)
≤ |P_{k}|^{2}

for each k.

Figure 6. Upper bounds for noise-ratios vs. row number for rows of P generated by the basis vectors in Equation (7).

Figure 6 shows plots of |P_{k}|^{2} versus k for the basis vectors in equation (7), and
subspace dimensions 1 through 6. The
first curve, labeled Dim 1, corresponds to a subspace G of dimension 1
generated by u_{1}. The second curve, labeled Dim 2, corresponds to a G
of dimension 2 generated by u_{1} and u_{2}, with similar
correspondences for the other curves. Notice that, for fixed k, |P_{k}|^{2} increases with the dimension of G, and that,
except in the Dim 1 case, is largest near the edges. These curves look the same for different
values of N, only the scale changes, uniformly decreasing as N increases. At the midpoint, the value for Dim 1 equals
that for Dim2; similarly the value for Dim 3 equals that for Dim4, and so on
for higher dimensions. Figure 7 shows Upper bounds for midpoint noise-ratios vs. number
of data points for the basis vectors in equation (7) and for subspaces G of odd dimension 1 through 9.

Figure 7. Upper bounds for midpoint noise-ratios vs. number of data points, for subspaces of dimension 1, 3, 5 ,7 and 9 generated by the basis vectors in Equation (7).

**Fourier Coefficients**: Figures 1 and 2 provide visual evidence for
the ability of the projection matrix to accurately extract the underlying
process from simulated noisy data. Another way to see this effect is to look at the Fourier coefficients of
x, y and z. Figure 8 shows the Fourier
coefficients F_{n} for the data in Figure 1, the estimate in Figure 2
and the noise in Figure 3, relative to the basis vectors in equation (7). Numbers on the x-axis are values of n called
the *coefficient number*. The
effectiveness of this procedure can be seen in the clean separation of the data
coefficients between the process and the noise. The process retains the first five coefficients, corresponding to the
five ON basis vectors used in its model.

Figure 8a. Fourier coefficients of the data in Figure 1

Figure 8b. Fourier coefficients of the process estimate in Figure 2

Figure 8c. Fourier coefficients of the noise estimate in Figure 3

**Subspace Dimension**: We have so
far avoided the question of which dimension to choose. In the simulated case above, we have the
truth, as seen in Figure 1, so can judge that any dimension less than 5
produces a biased estimate. The
situation is different in real-world applications, where the truth is never
known. Selecting the dimension size is equivalent to choosing kernel length in
kernel smoothing ([1], Chapter 6), the size of the multiplier of the penalty
term in penalized fit methods [([1], Chapter 7), and the frequency cutoff in low-pass digital filtering. A low-pass filter is a linear device that removes frequencies above the cutoff
frequency, which is exactly what the matrix P did as seen in Figures 8. Each coefficient F_{n} corresponds to
a single, cyclic data-component of fixed period L. If n > 1, then the period of the
corresponding component is given by

L = 2N/(n-1)

where N is the number of points. For example, the coefficient at n = 3 corresponds to a component with a period of 40.5 data points. In this memo our choice of subspace dimension is guided by the Fourier coefficients, as seen in the following example.

**AN EXAMPLE FROM THE STOCK MARKET**

Closing
prices for stocks and other market items are reported at the end of each
trading day, along with the number of shares traded, known as the item’s
volume. The product of these two numbers, called *price-volume*, is an
^{-8}) for Microsoft (MSFT)
ending 10/06/06; data points are indicated by circles and connected by straight
lines.

Figure 9. Price-volume (x10^{-8}) for
Microsoft ending 10/06/06.

Some market analysts search for cyclic patterns in this type of data to help determine market direction. Figure 10 shows Fourier coefficients for the data, which has been adjusted to have mean zero. Removing the mean is equivalent to setting the first coefficient to zero, which emphasizes the other values.

Figure 10. Fourier coefficients for the data in Figure 9 after mean deletion.

There are dominant values at
n=3, corresponding to a slowly varying component, and at n=8, the largest
value. In addition there are some
“natural gaps”, where coefficients have small values,

Figure 11. Data and smoothed estimates using basis vectors in equation (7), and dimensions 4, 9, 12 and 25.

The dominant component at n=8 can be isolated by subtraction and is shown in Figure 12. Its period is seen to be about 23 days, which corresponds to the time between the major peaks in Figure 9. The analyst can use any or all of the smoothings or isolated components for predicting market direction.

Figure 12. Data component corresponding to coefficient n=8.

1. Hastie, T, Tibshirani, R, and Friedman, J, *The Elements of Statistical Learning*, Springer, 2001.

2. Halmos, P, *Finite Dimensional Vector Spaces*, Van Nostrand, 1993.

3. Luenberger, D.G., *Optimization by Vector Space Methods*, Wiley, 1969.