Correlation And Prediction In The Stock market

by Rick Martinelli, March 2012

Copyright Haiku Laboratories



One of the first models of stock market data was introduced by Bachelier in 1900 [1].  His model assumes daily stock price changes form a white noise time series, i.e., a stationary, uncorrelated series with zero mean and constant variance.  As such, it was not a predictive model.  More sophisticated models, such as auto-regressive (AR) models utilize the correlations in weakly stationary series to provide predictive models [2], but stock data longer than a few days is not weakly stationary either [3,4].  Some of the non-stationary features in these series may be modeled by generalized AR models like ARCH [5] and GARCH [6], which incorporate a time-dependent variance and/or auto-covariance.   In this paper, the basic assumption is that one of the non-stationary properties enjoyed by many stocks is a high auto-correlation in price-changes, implying a longer than average trending period, during which linear prediction is more accurate. 


To test this assumption, we define two auxiliary time-series associated with a given stock.  The first is the series of auto-correlations in the price-changes of each three-day time interval.  This series indicates whether the price series is trending and the intensity of the trend, but not its direction.  The second series contains predicted price-changes based on each three-day sub-segment.  This latter series indicates the direction of any trend. The two series may be plotted as a two-dimensional “phase diagram” that characterizes the price series.  Our assumption is now that points in the phase diagram indicating a trend will be the most accurate in predicting the direction of price movement.  To test this assumption we employ a simulated trading system in which long and short market positions are generated from the phase diagram.  The accuracy of the method is measured by the ratio of successful positions to total positions, the so-called hit-ratio.  In addition to the hit-ratio, the amount of profit or loss realized if each indicated trade is made and sold at the next day’s price is also considered. 



Suppose X = Xk is a time-series of length N representing a stock’s daily price and define its increment series by Z = Zk = Xk – Xk−1.  The lag-1 auto-covariance coefficient of Z is defined by


C(Z) = <ZkZk−1> – <Zk>2 ,


where the notation <> means ‘average over all possible values of k’.  The lag-1 auto-correlation coefficient of Z (AC1) is a normalized version of C(Z) defined by


g(Z) = C(Z)/S(Z)2,


where S(Z)2 = <Zk2> − <Zk>2 is the variance of Z.  As such, the auto-correlation inherits most of the properties of the auto-covariance.


The individual products ωk = Zk Zk−1 are the smallest components of the AC1.  A positive ωk indicates a three-day trend in X, a micro-trend, ending at day k and will serve as the trend parameter below. The ωk’s indicate micro-trends in the price series in the following way.  If a stock is in a three-day upward trend ending at day k, then Xk > Xk−1 > Xk−2 so, in terms of increments, ZkZk−1 > 0.  For a three-day downward trend, Xk < Xk−1 < Xk−2 and again ZkZk−1 > 0.  Thus, ωk > 0 indicates a three-day trend.  If, conversely, ωk < 0 the increments Zk and Zk−1 are in opposite directions indicating either a short cycle, or noisy behavior. 



When ωk > 0 its value provides some indication of the intensity of a trend at time k but not the direction.  For direction we use a simple, linear, least-squares (LLSQ) estimate X*k+1 of Xk+1, calculated as a linear combination of the previous values Xk, Xk−1, Xk−2.  This predictor essentially extends the least-squares line through the three points to the predicted point.  The LLSQ model is Xk = Ak + B, where A is the slope of the prediction line and B the intercept.  The associated prediction equation is X*k+1 = A(k+1) + B. Using standard least-squares techniques the coefficients are found to be


A = (Xk – Xk−2)/2,


B = (4Xk−2 + Xk−1 – 2Xk)/3.


Substituting these values into the prediction equation yields the LLSQ-predictor on 3 points,


X*k+1 = (4Xk + Xk−1 – 2Xk−2)/3 ,


and predicted change  


δk = X*k+1 − Xk = (Xk + Xk−1 – 2Xk−2)/3 .


In terms of the increments Zk, the predicted change δk is a weighted average of adjacent values given by δk = (Zk + 2Zk−1)/3.  The value of δk indicates both the direction and amount of the predicted change, a positive value indicating a price increase, and a negative value a decrease.  As such it serves as the direction parameter in the system below.



The correlation components ωk and predicted changes δk may be displayed as ordered pairs (δkωk) in a ‘phase diagram’ which characterizes the increment series Z, and hence the price series X (Figure 1).  Noting that δk is a weighted average of Zk and Zk−1, and ωk is the square of their geometric mean, by the inequalities between arithmetic and geometric means of real numbers we have ωk ≤ 9δk2/8 at each time k.  Consequently, all points in the diagram are bounded above by a parabolic envelope with vertex at the origin and equation ωk = 9δk2/8. 

Figure 1. The phase diagram for FVE (NYSE, AC1 = 0.46, 252

trading days,  ending 03/07/12).  Points in the upper half-plane indicate

trends, while points in the lower half-plane indicate noise or cycles.

 Points in the right  half-plane indicate predicted price increases, and

 those in the left half-plane indicate predicted price decreases.


In the diagram, points in the upper half-plane (ωk > 0) represent locations of micro-trends in X; those in the lower half-plane indicate noise and/or cycles.  Points in the first quadrant are those having δk > 0 indicating an uptrend, while those in the second quadrant have δk < 0 indicating a down-trend.  (Quadrants are numbered 1 through 4, counter-clockwise starting in the upper right.)  Our assumption is now that points in the first two quadrants will indicate the most accurate predictions of price change.  In the analysis below, each of the four quadrants defines a trading system. 



Given a price series X, a trading system S for X is a time-series λk taking values 0, +1 and −1 only, where +1 represents a long position, −1 a short position, and 0 indicates no position, on day k.  (For prediction purposes, the series λ must be causally related to the series X, that is, the calculation of λk can only involve values of Xm if m < k.)  The trading systems used here are based on the points Pk = (δk, ωk) in the phase diagram: 


System S1: for each k, λk = 1 if Pk−1  is in the first quadrant, and λk = 0 otherwise

System S2: for each k, λk = −1 if Pk−1  is in the second quadrant, and λk = 0 otherwise

System S3: for each k, λk = −1 if Pk−1  is in the third quadrant, and λk = 0 otherwise

System S4: for each k, λk = 1 if Pk−1  is in the forth quadrant, and λk = 0 otherwise


To evaluate a trading system we use the return series rk = Xk/Xk−1−1.  This series represents the fractional amount of price change between day k−1 and day k.  A trade is deemed successful if the products


πk+1 = rk+1 λk


are positive, indicating rk+1 and λk have the same sign, and a correct prediction of the direction of price change.  In most systems, many of the λk’s will be zero.  The hit-ratio for system S and stock X is defined as the ratio of the number of positive πk’s to the number of non-zero πk’s at the end of the simulation, and is denoted by Hn = H(Sn, X).  It provides a measure of how well the system Sn predicts the direction of price changes for stock X (hit-ratios are reported in percent from zero to 100).  Assuming random predictions would result in 50% accuracy, relative-hit-ratios are calculated as Hn − 50%



In addition to the hit-ratio, we also consider the amount of profit or loss realized if each trade indicated by λk is actually bought on day k at price Xk and sold the following day at price Xk+1.  By considering only transactions of exactly one trading unit, rk becomes the actual profit/loss available on day k, and πk then represents the profit/loss achieved by the system on day k.  Summing the πk over all days gives the accumulated profit/loss from the simulation, or the system-return


Rn = R(Sn, X) = πk.


In a perfect trading system, where all predicted directions are correct, the maximum profit available from the stock X is the sum of absolute values of the rk, known as the available profit (AP) in X and denoted


A(X) = ∑| rk |.


Like the AC1, the AP is a property of the stock, not the trading system.  Normalizing the system-return to remove the effect of the AP gives the efficiency of the system Sn for stock X:


En = E(Sn, X) = R(Sn, X)/A(X).


The efficiency provides a measure of the relative performance of a given system on stocks with different AC1’s (efficiencies are reported as percents). The quantity A(X)/N is sometimes referred to as the volatility of the returns, which is essentially what the trading system is “trading”. 



A database of 4662 stocks from the NYSE(1810), AMEX(400) and NASDAQ(2452), ending 07 March 2012 (1 year, 252 data points), was complied for this study.  (All data was obtained from Yahoo Finance.)   The database was pre-screened to eliminate stocks having zero volume on the last day.  Daily data values consist of the four basic recorded prices: the open, the high and low of the day, and the close.  Previous work [7] had shown that the series of averages of the open and close prices (OC-average) often have much larger AC1 values than either the open or close series alone.  For this reason, the OC-average series for each stock was used in the simulations. 


Stocks in each of the three markets were submitted to the four systems described above.  That is, for each stock X and system Sn, the relative-hit-ratio Hn − 50% and the efficiency En were calculated.  Figure 2 shows the average relative-hit-ratios and efficiencies for each of the four systems Sn as applied to all stocks in the three markets.  In all cases, systems S1 and S2 achieved much higher values than systems S3 and S4, supporting our basic assumption. 


Also calculated for each stock is the AC1, and the trend-point ratios H12 = (H1 + H2)/2 and efficiencies E12 = (E1 + E2)/2 from system S12Figure 3 shows H12 and E12 plotted against AC1 for each of the three markets.  A linear relationship is assumed in each case and the resulting least-squares regression line is shown, along with its equation y = px + q, and error variance R2.  In the equation for hit-ratios for example, y is H12−50%, x is AC1, p is the slope and q the intercept of the line.  Here the AC1 is considered a pure number and the slope and intercept have units of percent.  Equation coefficients for relative-hit-ratios and efficiencies are summarized in Table 1.

























Table 1. Least-squares coefficients for the relative-hit-ratio and efficiency equations in Figure 2.


For example, an increase of one unit of AC1 results, on average, in an increase of 37% in the hit-rate, and 27.5% in efficiency for NASDAQ stocks under system S12.  Positive slope values in all cases support the assumption that larger AC1’s results in better predictions.



[1] Louis Bachelier, Théorie de la Spéculation, Annales de l'Ecole normale superiure, 1900.

[2] R. S. Tsay, Analysis of Financial Time Series, Wiley-Interscience, New Jersey, 2005

[3] R. L. Brown, J. Durbin, J. M. Evans, Techniques for Testing the Constancy of Regression Relationships over Time, Journal of the Royal Statistical Society. Series B, Vol. 37 (1975), 149-192.

[4] B. Mandelbrot, R. Hudson, The (mis)Behavior of Markets, Basic Books, New York, 2004.

[5] R.F. Engle, Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation, Econometrica, 1982, vol. 50, issue 4, pages 987-1007

[6] Tim Bollerslev, Generalized autoregressive conditional heteroskedasticity, Journal of Econometrics, 1986, vol. 31, issue 3, pages 307-327

[7] R.Martinelli, Induced Correlation in Stock Market Data, Dec, 2011.





AMEX Relative-hit-ratios vs trading system           AMEX Efficiencies vs trading system


NYSE Relative-hit-ratios vs trading system           NYSE Efficiencies vs trading system


NASDAQ Relative-hit-ratios vs trading system           NASDAQ Efficiencies vs trading system


Figure 2. Relative hit-ratios and efficiencies for each of the four trading systems S1,S2,S3,S4, as applied to all stocks in the three markets.





AMEX Relative-hit-ratios vs AC1                               AMEX Efficiencies vs AC1



NYSE Relative-hit-ratios vs AC1                          NYSE Efficiencies vs AC1



NASDAQ Relative-hit-ratios vs AC1                    NASDAQ Efficiencies vs AC1


Figure 3 – Relative-hit-ratios for trend points (H12 − 50% , left plots, green) and efficiencies E12 are plotted versus AC1 for stocks in the three markets.  Least-squares lines are shown, and their equations and error variance R2 are given at the top of each plot.