Correlation And Prediction In The Stock market
by Rick Martinelli, March 2012
Copyright Haiku Laboratories
INTRODUCTION
One of the first models of stock market data was introduced by Bachelier in 1900 [1]. His model assumes daily stock price changes form a white noise time series, i.e., a stationary, uncorrelated series with zero mean and constant variance. As such, it was not a predictive model. More sophisticated models, such as autoregressive (AR) models utilize the correlations in weakly stationary series to provide predictive models [2], but stock data longer than a few days is not weakly stationary either [3,4]. Some of the nonstationary features in these series may be modeled by generalized AR models like ARCH [5] and GARCH [6], which incorporate a timedependent variance and/or autocovariance. In this paper, the basic assumption is that one of the nonstationary properties enjoyed by many stocks is a high autocorrelation in pricechanges, implying a longer than average trending period, during which linear prediction is more accurate.
To test this assumption, we define two auxiliary timeseries associated with a given stock. The first is the series of autocorrelations in the pricechanges of each threeday time interval. This series indicates whether the price series is trending and the intensity of the trend, but not its direction. The second series contains predicted pricechanges based on each threeday subsegment. This latter series indicates the direction of any trend. The two series may be plotted as a twodimensional “phase diagram” that characterizes the price series. Our assumption is now that points in the phase diagram indicating a trend will be the most accurate in predicting the direction of price movement. To test this assumption we employ a simulated trading system in which long and short market positions are generated from the phase diagram. The accuracy of the method is measured by the ratio of successful positions to total positions, the socalled hitratio. In addition to the hitratio, the amount of profit or loss realized if each indicated trade is made and sold at the next day’s price is also considered.
CORRELATION AND TRENDS
Suppose X = X_{k} is a timeseries of length N representing a stock’s daily price and define its increment series by Z = Z_{k} = X_{k} – X_{k−1}. The lag1 autocovariance coefficient of Z is defined by
C(Z) = <Z_{k}Z_{k−1}> – <Z_{k}>^{2} ,
where the notation <…> means ‘average over all possible values of k’. The lag1 autocorrelation coefficient of Z (AC1) is a normalized version of C(Z) defined by
g(Z) = C(Z)/S(Z)^{2},
where S(Z)^{2} = <Z_{k}^{2}> − <Z_{k}>^{2} is the variance of Z. As such, the autocorrelation inherits most of the properties of the autocovariance.
The individual products ω_{k} = Z_{k} Z_{k−1} are the smallest components of the AC1. A positive ω_{k} indicates a threeday trend in X, a microtrend, ending at day k and will serve as the trend parameter below. The ω_{k}’s indicate microtrends in the price series in the following way. If a stock is in a threeday upward trend ending at day k, then X_{k} > X_{k−1} > X_{k−2} so, in terms of increments, Z_{k}Z_{k−1} > 0. For a threeday downward trend, X_{k} < X_{k−1} < X_{k−2} and again Z_{k}Z_{k−1} > 0. Thus, ω_{k} > 0 indicates a threeday trend. If, conversely, ω_{k} < 0 the increments Z_{k} and Z_{k−1} are in opposite directions indicating either a short cycle, or noisy behavior.
LINEAR PREDICTION
When ω_{k} > 0 its value provides some indication of the intensity of a trend at time k but not the direction. For direction we use a simple, linear, leastsquares (LLSQ) estimate X*_{k+1} of X_{k+1}, calculated as a linear combination of the previous values X_{k}, X_{k−1}, X_{k−2}. This predictor essentially extends the leastsquares line through the three points to the predicted point. The LLSQ model is X_{k} = Ak + B, where A is the slope of the prediction line and B the intercept. The associated prediction equation is X*_{k+1} = A(k+1) + B. Using standard leastsquares techniques the coefficients are found to be
A = (X_{k} – X_{k−2})/2,
B = (4X_{k−2} + X_{k−1} – 2X_{k})/3.
Substituting these values into the prediction equation yields the LLSQpredictor on 3 points,
X*_{k+1} = (4X_{k} + X_{k−1} – 2X_{k−2})/3 ,
and predicted change
δ_{k} = X*_{k+1} − X_{k} = (X_{k} + X_{k−1} – 2X_{k−2})/3 .
In terms of the increments Z_{k}, the predicted change δ_{k} is a weighted average of adjacent values given by δ_{k} = (Z_{k} + 2Z_{k−1})/3. The value of δ_{k} indicates both the direction and amount of the predicted change, a positive value indicating a price increase, and a negative value a decrease. As such it serves as the direction parameter in the system below.
THE PHASE DIAGRAM
The correlation components ω_{k} and predicted changes δ_{k} may be displayed as ordered pairs (δ_{k}, ω_{k}) in a ‘phase diagram’ which characterizes the increment series Z, and hence the price series X (Figure 1). Noting that δ_{k} is a weighted average of Z_{k} and Z_{k−1}, and ω_{k} is the square of their geometric mean, by the inequalities between arithmetic and geometric means of real numbers we have ω_{k} ≤ 9δ_{k}^{2}/8 at each time k. Consequently, all points in the diagram are bounded above by a parabolic envelope with vertex at the origin and equation ω_{k} = 9δ_{k}^{2}/8.
Figure 1. The phase diagram for FVE (NYSE, AC1 = 0.46, 252
trading days, ending 03/07/12). Points in the upper halfplane indicate
trends, while points in the lower halfplane indicate noise or cycles.
Points in the right halfplane indicate predicted price increases, and
those in the left halfplane indicate predicted price decreases.
In the diagram, points in the upper halfplane (ω_{k} > 0) represent locations of microtrends in X; those in the lower halfplane indicate noise and/or cycles. Points in the first quadrant are those having δ_{k} > 0 indicating an uptrend, while those in the second quadrant have δ_{k} < 0 indicating a downtrend. (Quadrants are numbered 1 through 4, counterclockwise starting in the upper right.) Our assumption is now that points in the first two quadrants will indicate the most accurate predictions of price change. In the analysis below, each of the four quadrants defines a trading system.
TRADING SYSTEMS AND HIT RATIO
Given a price series X, a trading system S for X is a timeseries λ_{k} taking values 0, +1 and −1 only, where +1 represents a long position, −1 a short position, and 0 indicates no position, on day k. (For prediction purposes, the series λ must be causally related to the series X, that is, the calculation of λ_{k}_{ }can only involve values of X_{m }if m < k.) The trading systems used here are based on the points P_{k} = (δ_{k}, ω_{k}) in the phase diagram:
System S_{1}: for each k, λ_{k} = 1 if P_{k−1} is in the first quadrant, and λ_{k} = 0 otherwise
System S_{2}: for each k, λ_{k} = −1 if P_{k−1} is in the second quadrant, and λ_{k} = 0 otherwise
System S_{3}: for each k, λ_{k} = −1 if P_{k−1} is in the third quadrant, and λ_{k} = 0 otherwise
System S_{4}: for each k, λ_{k} = 1 if P_{k−1} is in the forth quadrant, and λ_{k} = 0 otherwise
To evaluate a trading system we use the return series r_{k} = X_{k}/X_{k−1}−1. This series represents the fractional amount of price change between day k−1 and day k. A trade is deemed successful if the products
π_{k+1} = r_{k+1 }λ_{k}
are positive, indicating r_{k+1} and λ_{k} have the same sign, and a correct prediction of the direction of price change. In most systems, many of the λ_{k}’s will be zero. The hitratio for system S and stock X is defined as the ratio of the number of positive π_{k}’s to the number of nonzero π_{k}’s at the end of the simulation, and is denoted by H_{n} = H(S_{n}, X). It provides a measure of how well the system S_{n} predicts the direction of price changes for stock X (hitratios are reported in percent from zero to 100). Assuming random predictions would result in 50% accuracy, relativehitratios are calculated as H_{n} − 50%.
SYSTEMRETURN AND EFFICIENCY
In addition to the hitratio, we also consider the amount of profit or loss realized if each trade indicated by λ_{k} is actually bought on day k at price X_{k} and sold the following day at price X_{k+1}. By considering only transactions of exactly one trading unit, r_{k} becomes the actual profit/loss available on day k, and π_{k} then represents the profit/loss achieved by the system on day k. Summing the π_{k} over all days gives the accumulated profit/loss from the simulation, or the systemreturn
R_{n} = R(S_{n}, X) = ∑π_{k}.
In a perfect trading system, where all predicted directions are correct, the maximum profit available from the stock X is the sum of absolute values of the r_{k}, known as the available profit (AP) in X and denoted
A(X) = ∑ r_{k }.
Like the AC1, the AP is a property of the stock, not the trading system. Normalizing the systemreturn to remove the effect of the AP gives the efficiency of the system S_{n} for stock X:
E_{n} = E(S_{n}, X) = R(S_{n}, X)/A(X).
The efficiency provides a measure of the relative performance of a given system on stocks with different AC1’s (efficiencies are reported as percents). The quantity A(X)/N is sometimes referred to as the volatility of the returns, which is essentially what the trading system is “trading”.
SIMULATION RESULTS
A database of 4662 stocks from the NYSE(1810), AMEX(400) and NASDAQ(2452), ending 07 March 2012 (1 year, 252 data points), was complied for this study. (All data was obtained from Yahoo Finance.) The database was prescreened to eliminate stocks having zero volume on the last day. Daily data values consist of the four basic recorded prices: the open, the high and low of the day, and the close. Previous work [7] had shown that the series of averages of the open and close prices (OCaverage) often have much larger AC1 values than either the open or close series alone. For this reason, the OCaverage series for each stock was used in the simulations.
Stocks in each of the three markets were submitted to the four systems described above. That is, for each stock X and system S_{n}, the relativehitratio H_{n} − 50% and the efficiency E_{n} were calculated. Figure 2 shows the average relativehitratios and efficiencies for each of the four systems S_{n} as applied to all stocks in the three markets. In all cases, systems S_{1} and S_{2} achieved much higher values than systems S_{3} and S_{4}, supporting our basic assumption.
Also calculated for each stock is the AC1, and the trendpoint ratios H_{12} = (H_{1} + H_{2})/2 and efficiencies E_{12} = (E_{1} + E_{2})/2 from system S_{12}. Figure 3 shows H_{12} and E_{12} plotted against AC1 for each of the three markets. A linear relationship is assumed in each case and the resulting leastsquares regression line is shown, along with its equation y = px + q, and error variance R^{2}. In the equation for hitratios for example, y is H_{12}−50%, x is AC1, p is the slope and q the intercept of the line. Here the AC1 is considered a pure number and the slope and intercept have units of percent. Equation coefficients for relativehitratios and efficiencies are summarized in Table 1.

HitRatios 
Efficiencies 


Slope 
Intercept 
Slope 
Intercept 
AMEX 
35.5% 
2.6% 
26.8% 
0.7% 
NYSE 
31.5% 
0.5% 
28.7% 
0.7% 
NASDAQ 
37.0% 
3.3% 
27.5% 
0.1% 
Table 1. Leastsquares coefficients for the relativehitratio and efficiency equations in Figure 2.
For example, an increase of one unit of AC1 results, on average, in an increase of 37% in the hitrate, and 27.5% in efficiency for NASDAQ stocks under system S_{12}. Positive slope values in all cases support the assumption that larger AC1’s results in better predictions.
REFERENCES
[1] Louis Bachelier, Théorie de la Spéculation, Annales de l'Ecole normale superiure, 1900.
[2] R. S. Tsay, Analysis of Financial Time Series, WileyInterscience, New Jersey, 2005
[3] R. L. Brown, J. Durbin, J. M. Evans, Techniques for Testing the Constancy of Regression Relationships over Time, Journal of the Royal Statistical Society. Series B, Vol. 37 (1975), 149192.
[4] B. Mandelbrot, R. Hudson, The (mis)Behavior of Markets, Basic Books, New York, 2004.
[5] R.F. Engle, Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation, Econometrica, 1982, vol. 50, issue 4, pages 9871007
[6] Tim Bollerslev, Generalized autoregressive conditional heteroskedasticity, Journal of Econometrics, 1986, vol. 31, issue 3, pages 307327
[7] R.Martinelli, Induced Correlation in Stock Market Data, Dec, 2011.
AMEX Relativehitratios vs trading system AMEX Efficiencies vs trading system
NYSE Relativehitratios vs trading system NYSE Efficiencies vs trading system
NASDAQ Relativehitratios vs trading system NASDAQ Efficiencies vs trading system
Figure 2. Relative hitratios and efficiencies for each of the four trading systems S1,S2,S3,S4, as applied to all stocks in the three markets.
AMEX Relativehitratios vs AC1 AMEX Efficiencies vs AC1
NYSE Relativehitratios vs AC1 NYSE Efficiencies vs AC1
NASDAQ Relativehitratios vs AC1 NASDAQ Efficiencies vs AC1
Figure 3 – Relativehitratios for trend points (H_{12} − 50% , left plots, green) and efficiencies E_{12} are plotted versus AC1 for stocks in the three markets. Leastsquares lines are shown, and their equations and error variance R^{2} are given at the top of each plot.