composing-3827454_1920 edit 2

Problems with smart beta – part 8: Backtesting

This is the eighth part of a series discussing some of the problems associated with investing in “smart beta” strategies. For the previous post on how factor returns from academia don’t always translate into actual returns, click here.

Would you entrust your investments to a group of monkeys? Probably not.

What if you were told that the monkeys had been trained to use a proprietary quantitative trading strategy for stock selection? Still, probably not.

How about if the monkeys’ strategy had been shown to beat the market without fail every year for the last 40 years, producing top decile returns and outperforming all their peers? Those returns are pretty impressive, maybe the monkeys are on to something.

But my guess is that most of us would think twice before investing with them, because we’d know it was monkeys picking the stocks, and therefore there must be some funny business going on.

But what if we flip the example, and instead we were told it was humans picking the stocks? They use the same process and achieve the same returns. Would we be so sceptical of the returns then? Probably not. In fact, I’d imagine most people would be sorely tempted to invest.

What is data mining?


In short, data mining is the problem of finding relationships between variables in historical data where, in reality, none exist. There’s an old quote to explain data mining that goes, “If you torture the data long enough, it’ll confess to anything.” The idea being that historic data contains so much information that if you analyse it long enough, you’ll be able to find any result you want.

For example, if we were trying to find a variable to predict the movements of the MSCI World, and we found one that looked like the graph below, we’d be pretty pleased. The red line is the average annual value of the MSCI World:

MSCI World data mining example 1

Source: Political Calculations

In statistical speak, the 2 variables have an R2 value of 0.96, which means that 96% of the movement in the MSCI World is explained by the movement in the blue line – i.e. they are very highly correlated.

But not only do they have a very high correlation, it looks like whenever the MSCI World overtakes the blue line, a crash soon follows. The model correctly predicted the crash in the 1980s, the dot-com crash, and the 2008 crash – most people would kill for a model with such predictive power. But should we be using this to time the stock market and save ourselves from market crashes?

Whilst the red line does represent the MSCI World, the blue is in fact the average live weight of farm-raised turkeys in the United States:

MSCI World data mining example 2

Source: Political Calculations

Now that we know what the variables are, the correlation appears to be less convincing. Correlation is not causation.

Despite the data finding a very strong relationship between turkey weight and the MSCI World, intuitively we know that the two are completely unrelated. We would never base our entire investment strategy on turkeys, because we know that despite what the data says, there is no relationship between the two.

This is the data mining problem – finding relationships in data where none exist. And why having a strong economic rationale for why the two variables should be related is so important when trying to prove relationships.

If we look at any data set long enough, we can find all sorts of so-called “spurious correlations”. A great resource for more data-mining examples is the source for the following graph, which contains dozens of spurious correlations born from data-mining:

Spurious correlations example

Source: Spurious Correlations

How does it relate to investing?


Given the huge amount of data on all the global stock markets, data mining can be a real issue for investors.

Before we have a look at a few examples, there are a few definitions that we need to understand. Statisticians, just like those of us working in investments, seem to enjoy making their field as impenetrable as possible by flinging technical jargon at the reader with reckless abandon.


In-sample vs out-of-sample

Statisticians use the terms “in-sample” and “out-of-sample” to describe what data the forecasting model is being tested on. The “sample” is the data sample that is being used to create the model.

When constructing a model, firstly you start with a sample from which to construct your model. Secondly, you create a model based on the data in the sample. And thirdly, you use the model for forecasting.

If you are forecasting for an observation that was part of the data sample – it’s an in-sample forecast. If you are forecasting for an observation that was not part of the data sample – it’s out-of-sample forecast.

For example, if you use data for 1990-2013 to create the model, and then you forecast for 2011-2013, it’s an in-sample forecast. But if you use the same data for fitting the model and then you forecast 2014-2017, then it’s an out-of-sample forecast.


Returning to the real world, a good way to think about how data mining can be used to create spectacular investing strategies is by looking at our stock-selecting monkeys.

I can now reveal that their “proprietary quantitative trading strategy” was in fact thousands of monkeys throwing darts at the stock pages of the Wall Street Journal.

Some monkeys will have great performance, and selecting a combination of these monkeys in-sample will yield spectacular performance. Of course, this group will not perform out of sample.

This example, courtesy of Wes Gray from Alpha Architect, is a fantastic way to think about data mining. It shows how easy it can be to use data mining to find a strategy that looks great in-sample, but will obviously perform terribly out-of-sample.

The problem is that a data-mined strategy can look extremely convincing to a potential investor. Consider the in-sample performance of the stock-picking monkeys – if they had been humans, their returns would have looked incredible on paper, and many investors would be tempted to invest with them based on their fantastic track record. But the returns were in fact nothing more than picking the best combination of returns in hindsight from thousands of possible combinations.

Similarly, Jason Zweig showed in a Wall Street Journal article that you could have beaten the market over the last 20 years by buying companies in the US whose ZIP code digits add up to 21. Or by using a strategy that buys companies whose ticker symbols are derived exclusively from letters in either the word “Republican” (Rubicon Technology, or RBCN) or “Democrat” (Caterpillar, or CAT). The portfolio holds only Republican-tickered stocks in years when the GOP holds the presidency. When a Democrat is in the White House, the strategy holds only stocks with “Democrat” tickers.

Both of these strategies looked great in backtests, but you would never expect them to perform in the real world.

In a recent Bloomberg interview with enigmatic hedge fund manager Cliff Asness of AQR, Cliff was asked specifically whether he was worried that his quantitative strategies were the result of data mining. His response was:

I’m always in a panic that they’re the result of data mining or survivorship bias. We spend our lives trying to disprove that. I’ll still wake up sweating once a year, worrying that we’ve just gotten lucky forever.

If he’s worried that his strategies could be nothing more than a product of data-mining, then we should be worried about ours too.

Trial and error


One of the reasons why data mining (also known as “backtest overfitting”) is so prominent in finance is explored in this paper in the Notices of the American Mathematical Society. Intuitively, we know that the more different configurations someone tries to see whether an investing strategy works, the more likely that the results are as a result of data mining. They’ll simply keep trying configurations until one works. The authors show that because most financial analysts and academics rarely report the number of configurations tried for a given backtest, investors cannot evaluate the degree of overfitting in most investment proposals. Therefore, the more trials a financial analyst executes, the greater should be the in-sample Sharpe ratio demanded by the potential investor.

The authors “strongly suspect” that such backtest overfitting is a large part of the reason why so many algorithmic or systematic hedge funds don’t live up to the elevated expectations generated by their managers. They try as many strategies as possible until they find one that works, without

How does data mining relate to factor investing?


Data mining has become an increasing problem in financial literature. There have been over 400 ‘factors’ that researchers have claimed to be statistically significant in papers published in academic journals. The conflict in financial academia is that researchers are incentivised to publish statistically significant results. This naturally leads to strategies being data mined to increase the likelihood of them being published. So how many factors are actually robust, and how many are the result of data mining?

Factor strategies and data mining

Source: Chicago Booth Review

In perhaps the most extensive review of factors to date, Kewei Hou, Chen Xue, and Lu Zhang authored a paper named ‘Replicating Anomalies’ in 2017, reviewing 447 market “anomalies” (factors) to see if they could replicate the factors’ results under research conditions. The paper took the authors 3 years to write, and their ultimate conclusion was that 85% of the previously found anomalies failed to pass their tests, and of the 161 remaining significant anomalies, their magnitudes are often much lower than originally reported. Of the significant anomalies, the usual suspects are the ones that survive, including momentum, value, investment, and profitability.

Speaking about the authors of financial research, they note in their conclusion that “authors often engage in p-hacking [a form of data-mining], i.e., selecting sample criteria and test procedures until insignificant results become significant. The likely outcome is an embarrassingly large number of false positives that cannot be replicated in the future… The evidence indicates p-hacking that authors search for specifications that deliver just-significant results and ignore those that give just-insignificant results to make their work more publishable… As such, most published anomaly profits are greatly exaggerated.”

These findings are echoed by researchers from Research Affiliates, who note in their paper ‘Alice’s Adventures in Factorland: Three Blunders That Plague Factor Investing’ that “of the thousands of factors tested, some will look good in the backtest purely by luck, i.e. as a consequence of data mining and backtest overfitting (point made by Harvey and Liu (2015) among others). Importantly, many of these lucky factors have little or no economic foundation which is emphasized in Harvey (2017). Some of these factors possibly look good as a result of coding mistakes by researchers or due to problems with the data. For example, McLean and Pontiff (2016) failed to replicate the in-sample performance of 12 out of 97 factors in their examination of the many published anomalies.

Data mining isn’t the only problem with backtests


Data mining is clearly a problem to be addressed when drawing conclusions from academic research, but there’s another reason why backtests should be treated with caution.

The crux of the idea is that the market is incredibly different today than it has been in the past. And because the market is so different today, why should a strategy based on data from 200 years ago be relevant to today’s market?

By way of example, the graph below shows how different the sector composition of the market has been in the past to how it is today. If your strategy relies on data going back 200 years, it might sound impressive to investors, but does the fact that it worked when financials were over 90% of the market mean that it must also work today?

Sector changes of the S&P 500

Source: Visual Capitalist

Reinforcing this point, we’ve seen in a previous post (refer to the P/B section) that our economies have transitioned from being dependent on companies who value was derived from manufacturing capability and physical capital, to one dependent on companies whose values are increasingly dependent intangible assets – i.e. services and technology businesses. Today, intangibles like brands, patents, and proprietary data make up much more of the value of a company than they did 100 years ago, when the economy was ruled by manufacturing.

And the same thing can be said for the regional composition of markets. The graphs below show that the UK used to be 25% of the global stock market, but now represents under 6%. The USA used to be 15% of the market, but now represents over 50%:

Geographical changes of the market

Source: Credit Suisse

Not only has the sectoral composition of the market changed, but the regional composition has also changed.

We’ve also seen in another post that returns for factor strategies decay after they become widely known, and that the market is becoming so efficient at pricing securities that most professionals fail to beat the market.

But this hasn’t always been the case. In the golden days before computers commoditised stock market information, many investors made their name exploiting stock market anomalies. These were the days before Bloomberg terminals, before the internet, and before the CFA program. Much less was known about how the market worked, and there was much less competition in the marketplace. The lack of technology made the market more inefficient, and so made it easier for investors to find an edge.

Some investors used strategies which systematically selected stocks that shared certain characteristics. Warren Buffett, for example, made his fortune investing in high-quality companies, with wide “moats”, at reasonable valuations. What has since been discovered though, by researchers at AQR, is that Buffett’s incredible outperformance of the S&P was mostly due to him exploiting traditional factors – low-volatility, quality, and value.

What used to be seen as genius stock selection is now regarded as run-of-the-mill factor exposure – factors provided much better returns when nobody knew they existed.

The market has evolved, and in its evolution, has reduced the relevance of historic data. Sector compositions of the market have changed, the nature of business’ value has changed, regional compositions have changed, and risk/behavioural-based stock-selection factors are now well-known and heavily exploited.

A strategy that looks good when tested on historic data might not work as well as advertised, for no other fact than the market on which it was tested no longer exists.

Real world examples


Backtesting has launched many a quantitative investment strategy – some have gone on to perform spectacularly well, and some have not.

Focussing on the latter category, where data-mining and other research problems have caused companies to overestimate the viability of a strategy based on its backtest, let’s have a look at some real-life examples.

Possibly the most high-profile fund launch based on back-tested results in recent months has been the Wealthfront Risk Parity fund. Wealthfront placed such faith in their backtested results that they published them online, and explicitly compared them to 2 of the major other risk parity funds on the market – 1 by Bridgewater, and one by AQR.

Wealthfront risk parity fund backtest

Source: WealthFront

Coming out and saying that your strategy will outperform 2 of the biggest names in quantitative investing in itself would probably be enough to cause a stir – usually it’s left up to the reader to compare results against whoever they want. But to call out Cliff Asness of AQR, one of the biggest personalities in finance, who famously (and often hilariously) crushes those who spout nonsense on Twitter – it was a bold move.

Further attention was drawn to the fund when Wealthfront started facing significant criticism in the media after launching the fund for 1) automatically opting clients in to its new fund, and 2) for being less-than-transparent about the fees the fund charged.

Both the confidence in its backtest and the controversy over the fund’s launch meant that investors were eagerly awaiting the fund’s out-of-sample performance. For those that were rooting for the fund to fail as a result of its hubris and underhandedness, they were not disappointed:

Wealthfront risk parity actual performance

Source: EconomPic

The fund heavily underperformed AQR after launching, and it served as a classic example of the dangers of assuming in-sample returns will continue out-of-sample.

Another instance of real-world performance not living up to a backtest comes from SocGen, who launched the ‘SGI Global Alpha’ index in Septmber 2008 on the back of an extremely attractive backtest, which showed a compound annual growth rate of over 15%.

It all sounded very promising.

The graph below shows the backtested in-sample performance versus the actual out-of-sample performance after it was launched:

SocGen backtestSource: SocGen

The 15% per year growth rate achieved in backtest has so far resulted in an annualised return since launch of -1%. Not exactly a stellar return, and far, far short of what was implied by the backtest.

As the saying goes “I’ve never seen a bad backtest”.


  • A group of monkeys can produce returns that beat the world’s greatest investors – if the right combination of monkeys is selected in hindsight.
  • Just because variables appear to be correlated, does not mean that one caused the other. Turkey weights are highly correlated to the MSCI World, but fatter turkeys do not lead to better stock market returns.
  • Data-mining is a problem in academic finance, with authors’ incentives encouraging the publishing of data-mined investing strategies.
  • Backtested results might not lead to the same results out-of-sample because of 1) data mining, and 2) the fact that market conditions are always changing.
  • Investors need to be wary of backtested results, and pay close attention to differences between in-sample and out-of-sample performances.
  • When forced to rely on only backtested results, it’s important to know the number of trials, and demand a higher standard of statistical significance before investing.
  • Most factors in the ‘factor zoo’ don’t stand up to scrutiny and have found to be the product of data mining and other backtesting problems. Only those strategies that stand up to the most rigorous testing should be considered for our investment.
Share on Facebook
Share on Twitter
Share on LinkedIn

Past performance does not guarantee future performance and the value of investments can fall as well as rise. The information on this site is provided for information only and does not constitute, and should not be construed as, investment advice or a recommendation to buy, sell, or otherwise transact in any investment including any products or services or an invitation, offer or solicitation to engage in any investment activity. Please refer to the full disclaimer on the disclaimer page.

Notify of

Inline Feedbacks
View all comments