The Twitter Predictor of the Dow: The Rise and Fall

The story of the Twitter predictor of the Dow, as devised by dr. Johan Bollen, reads like a classical rise and fall of the Roman Empire.

Dr. Bollen, Associate Professor of Informatics and Computing at the Indiana University, published a seminal article in 2010 stating that moods derived from Twitter feeds can predict the Dow Jones Industrial Average.

Bollen and 2 co-authors claim in the paper that mood indicators that are derived from millions of Twitter feeds, can predict the Dow. Dr. Bollen uses two mood tracking tools (Google Profile of Mood States, GPOMS, and an alternative tool OpinionFinder) that measure mood as expressed in millions of Tweets. Subsequently, the mood indicators are correlated with daily changes in the Dow’s closing value.

Twitter mood extractors

The mood extracting tools work a bit like a blackbox but basically it boils down to sifting through the enormous daily Twitter stream and capturing mood-indicating tweets that contain taglines such as ‘I feel’, ‘I am feeling’ or ‘makes me’. These tweets are analyzed and their words are matched with ‘lexicons’ of terms that are associated with certain moods. This leads to daily mood indicators.

Whereas OpinionFinder uses only one mood indicator (positive versus negative), GPOMS has 6 dimensions: Calm, Alert, Sure, Vital, Kind and Happy.

The core of the article is contained in Table II, presenting the correlation results of the daily mood indictors with the daily change of the Dow during the period that ranges from February 28, 2008 to November 3, 2008.

correlation results of daily change of the Down and lagged values of the mood indicators, source: Bollen, J., Mao, H., Zeng, X., Twitter mood predicts the stock market, 2010, http://arxiv.org/pdf/1010.3003.pdf

correlation results of daily change of the Down and lagged values of the mood indicators, source: Bollen, J., Mao, H., Zeng, X., Twitter mood predicts the stock market, 2010, http://arxiv.org/pdf/1010.3003.pdf

Based on these P-values, Bollen e.a. conclude that the Calm mood dimension “has predictive value with regards to the DJIA”. Apparently, changes in the Calm mood indicator predict changes in the Dow a couple of days later.

The authors proceed by analyzing the same correlation with the help of a neural network. They apply the model to predict the direction of the Dow change during a short test period (December 1 to December 19, 2008; 15 business days) and show that their model was accurate for 13 out of 15 business days, i.e. an accuracy rate of 13/15 = 86.7%.

The Rise…

Of course the Twitter prediction of the Dow did not go unnoticed. Dr. Bollen ended up explaining his Twitter predictions to numerous newspapers and TV channels in 2012 and 2013, among others Fox, CNBC and Bloomberg. Of course the 86.7% accuracy claim is well noticed in the financial news.

The article ranks high in terms of citations (694 citations according to GoogleScholar, accessed April 15, 2014).

In February 2011 the hedge fund Derwent Capital Markets launched a hedge fund using the Twitter predictor for investment direction.

… And decline: datamining bias

In April 2012 Dr. Bollen’s reasoning has been attacked by an anonymous blogger who disqualifies the research. We will not repeat all arguments here, but focus on three criticisms starting with the so-called ‘datamining bias’.

The anonymous blogger has taken a closer look at the reported p-values (see also the table above) and concludes that the research doesn’t test only one hypothesis but 49 hypotheses (7 variables * 7 lags each).

In my opinion the anonymous blogger convincingly points out that coming up with a few significant results (P below 10%, P below 5%) is not that interesting after testing many hypotheses. The blogger assembles all p-values and plots them into an empirical Cumulative Distribution Function that is reproduced below. The data points are all close to the 45 degrees line meaning that e.g. 20% of the P-Values are below 20%. Same for any percentage between 0% and 100%.

Empirical CDF of Bollen’s P-Values, reengineered in MATLAB after http://sellthenews.tumblr.com/post/21067996377/noitdoesnot

Empirical CDF of Bollen’s P-Values, reengineered in MATLAB after http://sellthenews.tumblr.com/post/21067996377/noitdoesnot

The anonymous blogger correctly concludes that this does not look like a significant finding. A significant finding would show a line that first rises fast above the red line. As the anonymous blogger puts it:

If the null were violated, i.e. if the Twitter mood data exhibited ‘causality’ on the DJIA movement, we should see a lot of p-values on the left side of the plot, and the empirical CDF would hug the left side and top of the plot, bowing away from the diagonal.

The author moves on by stating that a Bonferroni correction needs to be applied, meaning that testing n hypotheses for a significance level alpha, we should use alpha / n as the individual level of significance.

In an update, the anonymous blogger presents a discussion about the necessity of the Bonferroni correction. In this discussion a conflicting opinion is offered that states that Bonferroni corrections can be too conservative, leading to the increased probability of committing a type II error (erroneously failing to reject the null hypothesis) and that they provide the wrong incentive for researchers to publish the entire scope of the hypothesis testing.

Although the Bonferroni correction may seem to be too conservative, some kind of correction is necessary if testing is data-driven rather than hypothesis-driven.

The problem of the one-day skipping impact

The anonymous blogger also points out that there is no sound explanation of the fact that the Calm indicator is, according to the original article (so without the Bonferroni correction), significant for lags running from 2 to 6 days, but not at the 1 day lag. The blogger concludes (again convincingly) that:

Somehow, we are to believe, the information content ‘skips a day’ (or more). This is contrary to common sense, and common practice of downweighting older observations as less relevant. It is particularly hard to imagine how using two- or three-day old tweets would give one the best market timing model of all time.

Bollen’s research group has published new research in December 2011 in which a Twitter indicator for bullish sentiment has been used as a predictor for the Dow that shows the same confusing signs of the regression coefficients for the lags (in that case: positive sign for lag 1, negative sign for lag 2) and the same confusing outcomes for significance (1st , 2nd and 5th coefficient significant at the 5% level).

The difficulty is that there is no proper explanation of the causality according to which mood indicators lead up to changes of the Dow. In the CNBC interview Dr. Bollen gives it a try and explains that it is “perfectly possible” that there are a lot of investors on Twitter and that they pick up the mood on Twitter “and then that gets translated somehow – perhaps even unconsciously – into their investment strategies.”

Accuracy claim based on a very small time window

Thirdly, the anonymous blogger draws attention to the fact that the 86.7% accuracy claim is based on only 15 days of test data (December 1 to December 19, 2008; 15 business days). I agree that the 86.7% accuracy is a bold claim if it is based solely on a 15 days window chosen by the researchers themselves.

Meanwhile the Twitter fund did not live up to its promises and had to shut down as reported by the Financial Times in May 2012 (‘Last Tweet for Derwent’s Absolute Return’). The question is what relevance this fact has on the validity of the research. In the update of the anonymous blogger the conflicting opinion is expressed that this ‘real-world experiment ’ has no bearing on the validity of the scientific result:

This type of argument from a supposed business application, or “real-world experiment”, is appropriate nor relevant in a discussion about the validity of a scientific result. Should all papers investigating the relative effectiveness of tubular solar panels be retracted because Solyndra went bankrupt?

Wisdom of the crowd?

My view on this? I think that the power to predict the stock market with the help of Twitter mini-blogs is a far-reaching claim. The fact that a hedge fund that has been set up to capitalize on this very claim, has been liquidated short after its inception does not help to support the claim that Twitter can predict the Dow.

Conversely, imagine that the fund would have succeeded; in that case I think that that success would have been used as a corroboration of the original claim. Why shouldn’t its liquidation be interpreted as a lack of support of the claim that Twitter mood predicts the Dow and can give investors an edge?

I also think that the model specification is much too simple in order to squeeze structural arbitrage opportunities out of the market. Why would your private expressions starting with ‘I feel …’ be predictive of the Dow? So, yes, in this regard I am skeptical of the wisdom of the crowd.

References

Bollen, J. , Mao, H., Zeng, X., Twitter mood predicts the stock market, 2010, http://arxiv.org/pdf/1010.3003.pdf

Mao, H., Counts, S., Bollen, J., Predicting Financial Markets: Comparing Survey, News, Twitter and Search Engine Data, 2011, http://arxiv.org/pdf/1112.1051.pdf

Advertenties

Over Folpmers
Financial Risk Management consultant, manager van een FRM consulting department, bijzonder hoogleraar FRM

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit / Bijwerken )

Twitter-afbeelding

Je reageert onder je Twitter account. Log uit / Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit / Bijwerken )

Google+ photo

Je reageert onder je Google+ account. Log uit / Bijwerken )

Verbinden met %s