Data-Smoothing By Vector Space Methods
by Rick Martinelli, Haiku Laboratories June 2003
Copyright ©, 2003
Updated November 2006

Introduction
Vector Spaces and Least Squares
Projection Matrix
An Example From The Stock Market
References


INTRODUCTION
Suppose a time-varying process x(t) is measured at regular intervals, and it is known that the measurements are contaminated with noise.  If we let z(k) represent the measurement at the kth interval, the situation may be represented by

(1)                                                            z(k) = x(k) + y(k)    k = 1,2,...,N,

where {x(k)} is called the process, {z(k)} is called the data, {y(k)} are samples from a zero-mean random sequence with fixed variance σ2, and N is the number of measurements.  The data‑smoothing problem is to estimate the process {x(k)} from the data {z(k)}.  This is a centuries old problem, first addressed by the likes of Gauss and Legendre who formulated the first least-squares estimates.

Figure 1 shows a set of simulated measurements (the data) as green circles, together with the underlying process (blue line).  The process is a uniform sampling of N=81 points of the exponential function exp(‑t2).  The noise sequence is machine-generated random numbers from a uniform distribution with mean 0 and variance 1.  The noise was treated to remove low frequency components as described below.

Figure 1.  Simulated data (green circles) and the underlying process.  Data points are connected by dashed lines.

The problem is to retrieve a good approximation of the process from the noisy data.  The usual approach is to impose some sort of smoothness condition on x, either by way of a process model, as in the case of least-squares estimates, or frequency restrictions as used in signal processing. More sophisticated approaches include a smoothness parameter that determines the trade-off between fidelity to a model and minimization of the error variance (see [1] for comparisons between approaches.).

The technique described here is a modified basis-expansion method (see [1]) which combines features from both the least-squares and signal processing approaches.  Generic modeling functions are chosen so that the process contains a small number of the slowly varying (low frequency) components of the data, including the mean, and the noise contains all other components.  Then the least-squares estimate of the process is calculated by applying a linear transformation, in the form of a projection matrix, to the data.  This results in a very fast algorithm called a lowpass filter that may be used in real-time applications.

Figure 2.  Smoothed estimates (circles) and the underlying process.

Figure 2 shows the results of this procedure applied to the data in Figure 1.  Smoothed estimates are shown as individual circles; the process is shown as a solid line.  The modeling functions used here are

(2)                                                                                                                                    {1, Cos(πt), Cos(2πt), Cos(3πt), Cos(4πt)}

The noise is found by subtracting the estimates from the data and is shown in Figure 3.

Figure 3.  Noise estimate: Data minus smoothed estimate.

In the next section we provide an outline of the aspects of vector spaces and the least-squares method needed to develop the projection matrix.  In section 3, the projection matrix is calculated and some of its properties described.  In the last section, the method is used to smooth real stock-market data to extract the dominant frequency components. The vector space techniques used here may be found in references [2] and [3]; there is nothing new here except possibly the explicit calculation of the projection matrix from the modeling functions in Theorems 1 and 2.

VECTOR SPACES AND LEAST-SQUARES

The following is an outline of the machinery required to calculate the projection matrix. An N-dimensional vector space H is a set of N-tuples

                                                   v = (v(1), v(2),…, v(N))

where the v(k) are real numbers called the coordinates of v.  The zero vector in H is (0, 0,…, 0).

All vector spaces considered in this memo will be N-dimensional vector spaces for some fixed N.  Such spaces are generalizations of the 2-dimensional plane where vectors are represented by arrows.

Example. Equation (1) in vector notation is

(3)                                                         z = x + y.

Let S = {v1, v2,…, vn} be a set of vectors in H.  A linear combination of the vectors in S is a sum of the form

                                                     a1v1 + a2v2 +…+ anvn

where the ak are real constants.  The set of all such linear combinations is called the subspace generated by S and is denote by [S]. 

Independence:  A vector in H is linearly dependent on S if it is a member of [S], and linearly independent of S otherwise.  The set S is called linearly independent if each of its members is linearly independent of the remaining members.  A linearly independent set S is a basis for [S]. 

Example. The set S of polynomials

S = {Vj(k) = (k/N) j , k =1,…,N}

where j=0,…,N-1 is a basis for H. 

Example. The modeling functions in expression (2) above may be sampled to yield the set

                              S =   {1, Cos(πkt/N), Cos(2πkt/N), Cos(3πkt/N), Cos(4πkt/M)}

which forms a basis for the 4-dimensional subspace [S] of H. 

In particular, if the set S contains N linearly independent vectors, then S is a basis for H, and every vector in H is a linear combination of the basis vectors,

(3)                                                   v = a1v1 + a2v2 +…+ aNvN

for come constants ak

Vector Norm: The notion of “vector length” can be generalized from the 2-dimensional case by defining the (squared) norm of a vector v = (v(1), v(2),…, v(N)) in H as

                                              |v|2 = v(1)2 + v(2)2 + ,…,+ v(N)2

This is called the Euclidean norm, and H is called Euclidean N-space.  A vector of norm one is called a unit vector. 

Inner Product:  The notion of dot product between vectors in two dimensions may be generalized to N dimensions by defining the inner product of two vectors v and w in H as

                                    <v, w> = v(1)w(1) + v(2)w(2) +…+ v(N)w(N).

The inner product is symmetric: <v, w> = <w, v>. When N=2 it may be shown that this definition is equivalent to the standard dot product given by

|v||w| Cosine( β )

where β is the angle between the vectors.  Thus, the inner product in H generalizes the notion of angle between vectors.  With the inner product defined, H becomes a Hilbert space called Euclidean N-space.  Note that the inner product <v, v> is the (squared) Euclidean norm of v. 

Orthonormal Basis :  Two vectors are orthogonal in H, written vw, whenever <v, w> = 0.  Orthogonality generalizes the 2-dimensional notion of perpendicularity.  If two vectors are orthogonal they are independent, but independent vectors may not be orthogonal.  An orthonormal (ON) basis for H is a set of N mutually orthogonal unit vectors.  Thus the set of vectors Ω = {u1, ..., uN} is an ON basis for H if and only if <uk,uj> = 1 when k = j and is zero otherwise. A basic result in Hilbert space (used below) is the Schwartz inequality (see[2, §64] for a proof): for any two vectors v and w,

                                                      |<v, w>| ≤ |v||w|.

(Note the double use of "|" as absolute value on the left, and vector norm on the right.)

Fourier Series:  If v is a vector in H and Ω is an ON basis for H, then the orthonormal property of Ω implies equation (3) can be written

                                      v = <v, u1> u1 + <v, u2> u2 + ... <v, uN> uN, or

                                                 v = F1u1 + F2u2 + ... + FNuN.

This is called the Fourier series representation of v and the inner products Fn = <v, un> are the Fourier coefficients of v relative to Ω. 

Example:  The Fourier series for z is

(4)                                 z = <z, u1> u1 + <z, u2> u2 + ... + <z, uN> uN

Example:  Assume that x is well-approximated by a linear combination of ON vectors {u1, ..., uM} from H, where M < N.  Then

(5)                                             x ≈ F1u1 + F2u2 + ... + FMuM

where Fj = <x, uj> are the (unknown) Fourier coefficients of the process x.

Orthogonal Projection:  If Ω is an ON basis for H, and G is the subspace generated by a proper subset of Ω, say Ω’ = {u1, ..., uM} from H, where M < N, then

                                     v* = <v, u1> u1 + <v, u2> u2 + ... <v, uM> uM

is called the orthogonal projection of v onto G.  Since H uses the Euclidean norm, v* corresponds to the least-squares estimate of v using vectors in G.  The subspace generated by the remaining members of Ω, namely {uM+1, ..., uN} is called the orthogonal complement of G and written GC.  If v G and w GC, then clearly v  w.  Any vector v in H can be written as the direct sum of two orthogonal vectors v = u + w, where u G and w GC

Example:  Equation (3) tacitly assumes that the noise y is orthogonal to x, and so z is a direct sum of x and y.  Using approximation (5) we have

                                      x = <z, u1> u1 + <z, u2> u2 + ... <z, uM> uM

                                 y = <z, uM+1> uM+1 + <z, uM+2> uM+2 + ... + <z, uN> uN

Least Squares:  Equation (4) and approximation (5) may be put in matrix form:

                                                            z ≈ Af + y,

where A is an N x M matrix with entries

                                                            akj = uj(k),           k = 1,…,N, j=1,…,M

and f is an M-dimensional vector of the unknown parameters fj. We seek the estimate f* of f that minimizes

                                                             | z – Af |2

It can be shown that the solution f* is the least-squares estimate of f given by

                                                          f* = (A'A)-1 A'z

where the prime ' denotes matrix transpose (See [3], section 4.3).  Hence the least-squares estimate of x given z is

                                                              x* = Pz

where

                                                          P = A(A'A)-1A'

is an N x N matrix. This shows that the least-squares estimate of x given z is a linear transformation acting on z, and which depends only on the vectors {uj}.  It is easily seen that matrix P has the property P2 = P, making it a projection matrix.  P is defined below as symmetric making it in fact an orthogonal projection.  The expression for P contains a matrix inversion which is computationally costly.  Using Fourier series expansions however, allows us to easily calculate P and prove it is the projection onto subspace G.  This is done in the next section.

PROJECTION MATRIX

The following theorem calculates the projection matrix P.

Theorem 1. Let H be the Hilbert space of N-tuples of real numbers and let Ω = {u1, ..., uN} be an ON basis for H.  Let G be the subspace of H generated by the subset Ω = {u1, ..., uM, M<N}. Let P be the symmetric, N x N matrix with components

(6)                                               Pk(m) = uj(k) uj(m)

where k,m = 1,2,…,N and M < N.  Then, for any vector v in H, Pv is the orthogonal projection of v onto G.  

Proof: Write v as

                                      v = <v, u1> u1 + <v, u2> u2 + ... + <v, uN> uN.

It must be shown that Pv is equal to

                                   v* = <v, u1> u1 + <v, u2> u2 + ... + <v, uM> uM.

Choose a fixed row Pk of P (equivalently, a coordinate v*(k) of v*).  Regarding Pk as a row vector and v as a column vector, matrix multiplication of Pk with v can be written as an inner product, and gives the k-th coordinate of v*:

                                  <Pk, v> = Pk(n) v(n) = uj(k) uj(n) v(n)

                                = uj(k) uj(n) v(n) = uj(k) <uj,v> = v*(k)

Since row k was arbitrary, this is true for all rows of P and so Pv = v*.

Gram-Schmidt:  In view of Theorem 1 it is essential to have an ON basis for H that also serves as process model.  Fortunately, for every set of N independent vectors there is and equivalent set that is orthonormal.  Let {v1, v2,…, vN} be a set of N independent vectors in H.  Then a new set {u1, u2, ,…, uN} may be generated that is an ON basis for H by following the Gram-Schmidt procedure.  First, normalize v1 to obtain u1:

                                                           u1 = v1 / |v1|.

Clearly, u1 and v1 generate the same subspace.  Next, u2 is obtained from v2 by subtracting any contribution to v2 by u1, and then normalizing the result again:

                                                     w2 = v2 - <v2, u1> u1

                                                          u2 = w2 / |w2|.

From  <w2, u1> = 0 , it follows that u1  u2.  Furthermore u1 and u1 generate the same subspace as v1 and v2. Continuing by induction

                                                  wn = vn - <vn, uk> uk

and

                                                          un = wn / |wn|.

for each n ≤ N.  By the same reasoning, the sets {u1, ..., uM} and {v1, v2,…, vM} generate the same subspace for each choice of M ≤ N, and uk  uj whenever j ≠ k.

Example:  To smooth the data in Figure 1, the Gram-Schmidt procedure was applied to modeling functions

                                                 {Cos((n-1)πt), n=1,2,…,81}

These functions were first sampled to produce the set of vectors

(7)                                     {vn(k) = Cos((n-1)πk/81), n=1,2,…,81},

where k =  0, 1, …, 80, then orthonormalized to get the basis vectors {uk}.  The first six of these are shown as curves in Figure 4.  (For better visibility, individual points have been connected to produce continuous color-coded curves.) The constant vector is u1 = 1/9.  Roughly speaking, it is u1 that captures the mean value of the data, while u2 models linear trend.  The other vectors help model varying degrees of curvature in the process.   

Figure 4. Graphs of the first six orthonormal basis vectors generated by the basis vectors in Equation (7).

Data-Smoothing Example: The process vector x shown in Figure 1 was calculated as 81 points of a uniform sampling of exp(-t2) on the interval [-3.3,3.3].   A noise vector was calculated as y = (I – P)w, where I is the identity matrix on H, and w is a vector of machine-generated random numbers from a uniform distribution with mean 0 and variance 1.  This insures that x and y are computationally orthogonal.  The resulting vector y had mean .01534 and variance 1.01884.  The simulated data z was obtained by adding corresponding coordinates of x and y.  To smooth the data the first 5 of these basis vectors in Figure 4 were used to calculate P in accordance with equation (6).  The result from applying P to the data in Figure 1 is the smoothed estimate Pz represented by the circles in Figure 2.  The error is less than .012 over the interval, with the greatest deviations near the end points.  Large end-point deviations are typical (see Figure 6).

Row Norms:  For each row Pk of P, the computation of xk as <Pk, z> is a weighted average of the coordinates of z.  Figure 5 is a plot of five of these rows, for the P matrix used to obtain the smoothed estimates in Figure 2.  An animation of all 81 functions would show a traveling wave, beginning high on the left, near 0.4, dropping quickly as it moves to the right with a peak height around 0.2, and finally building up high at the right end.  The symmetry of these curves is due to the fact that P itself is symmetric by definition and that P2 = P. 

Figure 5. Rows 1, 21, 41, 61, 81 of the P matrix used to obtain the estimates in Figure 2.

The rows Pk of P, regarded as vectors in H, enjoy another property, namely:

Theorem 2:  The square roots of the diagonal entries in P are the norms of their respective rows, i.e., |Pk|2 = Pk(k).

Proof:

|Pk|2 =<Pk, Pk> = Pk(j) Pk (j) = Pk(j) un(k) un(j)

= un(k) Pk(j) un(j) = un(k) Pk(n) = Pk(k) ▌.

Bias-Variance Tradeoff:  The squared-norms |Pk|2 are important because they represent an upper bound on the smoothing capability of their respective rows.  Define the estimation error due to P by

x* - x  = Pz – x = Px + Py – x = b + Py,

where the bias vector is defined by

b = Px – x,

and Py is the residuals vector.  The standard bias-variance decomposition of the squared-error is

                                   |x* - x|2  = |b + Py|2 = |b|2 + |Py|2 + 2|<b, Py>|.

The quantity |Py|2 is called residual variance due to P; it is the amount of noise remaining in the estimate.  The quantity |b|2 is called the bias due to P and indicates any lack of fit by the modeling functions.  The last term represents any correlation between the bias and residual variance, and is usually assumed to be zero.  Notice that, if y is truly orthogonal to G then residual variance is zero, and if x is truly a member of G then P is unbiased and b = 0. 

There is a bias-variance tradeoff inherent in this and similar procedures: as model complexity increases, the bias decreases while the variance increases (see reference [1] for more on this).  In the current procedure, model complexity is indicated by the dimension of the subspace G.  More modeling functions in G means less bias due to P at the cost of increased residual variance. 

Noise-Ratio:  Define a vector RG, called the noise-ratio due to G, coordinate-wise by

 RG(k) = (x(k)* - x(k))2/ |y|2.

RG(k) represents the fraction of noise left in the estimate by the kth row of P.  Its dependence on the number, the length and nature of the basis vectors is indicated by the subscript G.  Using the bias-variance decomposition coordinate-wise we have

(x(k)* - x(k))2 = (b(k) + <Pk,y>)2 = b(k)2 + <Pk,y>2 ≤ b(k)2 + |Pk|2 |y|2

where Schwartz was used to obtain the inequality, and the residuals and bias vector are assumed uncorrelated.  Thus, in the unbiased case, the squared-norms |Pk|2 are upper bounds for the noise-ratios:

RG(k) ≤ |Pk|2

for each k.

Figure 6. Upper bounds for noise-ratios vs. row number for rows of P generated by the basis vectors in Equation (7).

Figure 6 shows plots of |Pk|2 versus k for the basis vectors in equation (7), and subspace dimensions 1 through 6.  The first curve, labeled Dim 1, corresponds to a subspace G of dimension 1 generated by u1.  The second curve, labeled Dim 2, corresponds to a G of dimension 2 generated by u1 and u2, with similar correspondences for the other curves.  Notice that, for fixed k, |Pk|2 increases with the dimension of G, and that, except in the Dim 1 case, is largest near the edges.  These curves look the same for different values of N, only the scale changes, uniformly decreasing as N increases.  At the midpoint, the value for Dim 1 equals that for Dim2; similarly the value for Dim 3 equals that for Dim4, and so on for higher dimensions.  Figure 7 shows Upper bounds for midpoint noise-ratios vs. number of data points for the basis vectors in equation (7) and for subspaces G of odd dimension 1 through 9.

Figure 7. Upper bounds for midpoint noise-ratios vs. number of data points, for subspaces of dimension 1, 3, 5 ,7 and 9 generated by the basis vectors in Equation (7).

Fourier Coefficients:  Figures 1 and 2 provide visual evidence for the ability of the projection matrix to accurately extract the underlying process from simulated noisy data.  Another way to see this effect is to look at the Fourier coefficients of x, y and z.  Figure 8 shows the Fourier coefficients Fn for the data in Figure 1, the estimate in Figure 2 and the noise in Figure 3, relative to the basis vectors in equation (7).  Numbers on the x-axis are values of n called the coefficient number.  The effectiveness of this procedure can be seen in the clean separation of the data coefficients into the process and the noise coefficients.  The process retains the first five coefficients, corresponding to the five ON basis vectors used in the process model.

Figure 8a. Fourier coefficients of the data in Figure 1

Figure 8b. Fourier coefficients of the process estimate in Figure 2

Figure 8c. Fourier coefficients of the noise estimate in Figure 3

Subspace Dimension:  We have so far avoided the question of which dimension to choose.  In the simulated case above, we have the truth, as seen in Figure 1, so can judge that any dimension less than 5 produces a biased estimate.  The situation is different in real-world applications, where the truth is never known. Selecting the dimension size is equivalent to choosing kernel length in kernel smoothing ([1], Chapter 6), the size of the multiplier of the penalty term in penalized fit methods [([1], Chapter 7), and the frequency cutoff in low-pass digital filtering.  A low-pass filter is a linear device that removes frequencies above the cutoff frequency, which is exactly what the matrix P did as seen in Figures 8.  Each coefficient Fn corresponds to a single, cyclic data-component of fixed period L.  If n > 1, then the period of the corresponding component is given by

L = 2N/(n-1)

where N is the number of points.  For example, the coefficient at n = 3 corresponds to a component with a period of 40.5 data points.  In this memo our choice of subspace dimension is guided by the Fourier coefficients, as seen in the following example. 

AN EXAMPLE FROM THE STOCK MARKET

Closing prices for stocks and other market items are reported at the end of each trading day, along with the number of shares traded, known as the item’s volume.  The product of these two numbers, called price-volume, is an indicator of the item’s change in dollar-volume.  Figure 9 shows 81 points of price-volume data (x10-8) for Microsoft (MSFT) ending 10/06/06; data points are indicated by circles and connected by straight lines. 

Figure 9.  Price-volume (x10-8) for Microsoft ending 10/06/06.

Some market analysts search for cyclic patterns in this type of data to help determine market direction.  Figure 10 shows Fourier coefficients for the data, which has been adjusted to have mean zero. Removing the mean is equivalent to setting the first coefficient to zero, which emphasizes the other values. 

Figure 10.  Fourier coefficients for the data in Figure 9 after mean deletion.

There are dominant values at n=3, corresponding to a slowly varying component, and at n=8, the largest value.  In addition there are some “natural gaps”, where coefficients have small values, indicating very little contribution.  In figure 10 these occur at n=4,9,12,and 25, among others places.  Accordingly, the data in Figure 9 was smoothed using the basis vectors in (7) and dimensions of 4, 9, 12 and 25.  The results are shown in Figure 11, where the relation between dimension and bias in the smoothed curves is apparent.

Figure 11.  Data and smoothed estimates using basis vectors in equation (7), and dimensions 4, 9, 12 and 25.

The dominant component at n=8 can be isolated by subtraction and is shown in Figure 12.  Its period is seen to be about 23 days, which corresponds to the time between the major peaks in Figure 9.  The analyst can use any or all of the smoothings or isolated components for predicting market direction.

Figure 12.  Data component corresponding to coefficient n=8.

REFERENCES

1. Hastie, T, Tibshirani, R, and Friedman, J, The Elements of Statistical Learning, Springer, 2001.
2. Halmos, P, Finite Dimensional Vector Spaces, Van Nostrand, 1993.
3. Luenberger, D.G., Optimization by Vector Space Methods, Wiley, 1969.