ABR-14677 Camera Ready.pdf

Page 2 of 8

Samsa, G. (2023). Why Not Just Backtest? Archives of Business Research, 11(5). 72-79.

URL: http://dx.doi.org/10.14738/abr.115.14677.

WHAT IS A BACKTEST?

Backtesting involves applying an investment strategy or predictive model to historical data in

order to assess its performance [1]. A successful backtest is a necessary but not sufficient

condition for a strategy to perform as desired.

To backtest an investment strategy: (1) translate the strategy into a fully specified and

actionable algorithm; (2) using a historical database, implement the algorithm; and (3)

compare the returns from the historical database with risk-adjusted benchmarks. As an

example, the strategy could be something as simple as "buy all stocks with dividend yields

above 5%, hold them for a year, and then sell them". Using a historical database, this algorithm

could be implemented by purchasing all stocks with dividend yields above 5% on 01 January

1990, selling them on 31 December 1990, calculating an annual return, purchasing all stocks

with dividend yields above 5% on 01 Jan 1991, selling them on 31 December 1991, calculating

an annual return, etc. As a final step, the resulting set of returns is compared against a historical

benchmark, with risk adjustment accomplished by considering not only the mean return from

this set, but also the standard deviation. Indeed, much of the literature proceeds in essentially

this fashion (e.g., [2]).

A backtest is fundamentally different from specifying an investment strategy and then

observing how it performs going forward. It is this latter construct that is ultimately of interest

to investors.

WHAT IS A PREDICTIVE MODEL?

A classical predictive model such as linear regression maps a specific set of inputs, such as

annual dividend yield, price-earnings ratio, earnings growth rate, previous annual return, etc.,

into an output such as a predicted annual return. For example, a behavioral-finance-based

predictive model might assume that investors overvalue companies with very high rates of

earnings growth, in which case the rate of earnings growth becomes a predictor variable, with

the expectation that its regression coefficient will have a negative sign. Once the important

predictors have been identified an investment strategy is developed. For example, if earnings

growth turns out to be a strong predictor the associated investment strategy might be "short

stocks with unusually high rates of earnings growth" or, perhaps, "short stocks with unusually

high rates of earnings growth whose share price momentum has recently turned negative".

Predictive models can be causal or non-causal [3-5]. In a causal model the initial set of

candidate predictors are based on a conceptual model, as illustrated above. In a non-causal

model, predictors are still mapped to an output, but (1) a wider set of candidate predictors is

considered (often: all of the variables that are contained within a database); and/or (2) the

predictive model is so complex as to essentially be a "black box" -- in other words, the model

generates a prediction, but not a transparent justification for how that prediction was derived.

Non-causal modelers are willing to use surrogate measures as predictors: for example, if X1

causes Y but isn't directly measured, and X2 happens to be correlated with X1 in a dataset, a non- causal modeler would use X2 as a predictor of Y even though correlation doesn't imply causality,

and even though the correlation between X1 and X2 might be transient. Black-box-based non- causal models typically require large databases for their development.

Page 3 of 8

Archives of Business Research (ABR) Vol. 11, Issue 5, May-2023

Services for Science and Education – United Kingdom

Large datasets can support both causal and non-causal predictive models. Nevertheless, "big

data analytics", of which artificial-intelligence (AI) based predictive models are an example,

typically take advantage of big data by either (1) using causal predictors in an especially

complex fashion -- for example, by replacing assumptions that the predictors operated in a

simple linear fashion with interaction terms, sophisticated smoothing functions, etc.; or (2)

using non-causal modeling techniques with a wide variety of potential modeling structures. In

either case, large numbers of candidate models are fit, and the models which best fit the data

are retained as the final candidates, perhaps to be further culled using additional criteria.

HOW ARE PREDICTIVE MODELS VALIDATED?

As is the case for any statistical model, AI-based predictive models are at risk of "overfitting".

In other words, by retaining those models which best fit the data the modeler is not only

optimizing the "signal" representing the actual causal relationship between the predictors and

the outcome, but also to the "noise" representing the peculiarities of the data caused by the play

of chance [6]. As a general statistical principle, the greater the number of candidate models

being considered, the greater their complexity, and the greater the number of analytical steps

within the model-fitting process, the greater the likelihood that model will be overfit and thus

perform less well subsequently. In that sense, the strengths of big data analytics are also its

greatest potential weakness.

Modelers recognize this, and thus place especial emphasis on "model validation"[7]. For clarity

of exposition, we will assume that (1) the predictive model in question has a simple and

transparent structure such as linear regression; (2) its predictors are either causal or non- causal; and (3) it was selected from a very large number of candidate models. The general

principles described here apply to "AI-based predictive models" more generally.

Model validation techniques can be classified according to whether or not an additional dataset

is available: if not, the validation is "internal" and if so, the validation is "external". For external

validation, the model is developed on a "training set" and its performance is assessed using a

"test set". Because performance in the training set can be overstated, the training set is only

used for model development with the test set being used to assess model performance. In

essence, a wall is built between the two datasets.

Internal validation techniques are modeled after techniques for external validation. For

example, the single database used to derive the predictive model might be randomly split, with

90% of observations in the "training set" and the remaining 10% in the "test set" and a measure

of model performance such as an R-squared statistic obtained. Then, another 90:10 split is

implemented, and model performance assessed on the 10% "test set". Finally, the measures of

model performance are averaged, and the result is a measure of model performance that

adjusts for overfitting. Numerous variations on this overall theme exist, but the underlying idea

is the same. Of course, if the dataset is sufficiently large a single division into training and test

sets can be made, and techniques of external validation applied directly.

Both internal and external validation techniques can estimate the degree of overfitting.

External validation techniques can also be used to assess generalizability. For that purpose, the

more different are the training and test sets the better. For a causally-based model, successful