Page 2 of 8
73
Samsa, G. (2023). Why Not Just Backtest? Archives of Business Research, 11(5). 72-79.
URL: http://dx.doi.org/10.14738/abr.115.14677.
WHAT IS A BACKTEST?
Backtesting involves applying an investment strategy or predictive model to historical data in
order to assess its performance [1]. A successful backtest is a necessary but not sufficient
condition for a strategy to perform as desired.
To backtest an investment strategy: (1) translate the strategy into a fully specified and
actionable algorithm; (2) using a historical database, implement the algorithm; and (3)
compare the returns from the historical database with risk-adjusted benchmarks. As an
example, the strategy could be something as simple as "buy all stocks with dividend yields
above 5%, hold them for a year, and then sell them". Using a historical database, this algorithm
could be implemented by purchasing all stocks with dividend yields above 5% on 01 January
1990, selling them on 31 December 1990, calculating an annual return, purchasing all stocks
with dividend yields above 5% on 01 Jan 1991, selling them on 31 December 1991, calculating
an annual return, etc. As a final step, the resulting set of returns is compared against a historical
benchmark, with risk adjustment accomplished by considering not only the mean return from
this set, but also the standard deviation. Indeed, much of the literature proceeds in essentially
this fashion (e.g., [2]).
A backtest is fundamentally different from specifying an investment strategy and then
observing how it performs going forward. It is this latter construct that is ultimately of interest
to investors.
WHAT IS A PREDICTIVE MODEL?
A classical predictive model such as linear regression maps a specific set of inputs, such as
annual dividend yield, price-earnings ratio, earnings growth rate, previous annual return, etc.,
into an output such as a predicted annual return. For example, a behavioral-finance-based
predictive model might assume that investors overvalue companies with very high rates of
earnings growth, in which case the rate of earnings growth becomes a predictor variable, with
the expectation that its regression coefficient will have a negative sign. Once the important
predictors have been identified an investment strategy is developed. For example, if earnings
growth turns out to be a strong predictor the associated investment strategy might be "short
stocks with unusually high rates of earnings growth" or, perhaps, "short stocks with unusually
high rates of earnings growth whose share price momentum has recently turned negative".
Predictive models can be causal or non-causal [3-5]. In a causal model the initial set of
candidate predictors are based on a conceptual model, as illustrated above. In a non-causal
model, predictors are still mapped to an output, but (1) a wider set of candidate predictors is
considered (often: all of the variables that are contained within a database); and/or (2) the
predictive model is so complex as to essentially be a "black box" -- in other words, the model
generates a prediction, but not a transparent justification for how that prediction was derived.
Non-causal modelers are willing to use surrogate measures as predictors: for example, if X1
causes Y but isn't directly measured, and X2 happens to be correlated with X1 in a dataset, a non- causal modeler would use X2 as a predictor of Y even though correlation doesn't imply causality,
and even though the correlation between X1 and X2 might be transient. Black-box-based non- causal models typically require large databases for their development.
Page 3 of 8
74
Archives of Business Research (ABR) Vol. 11, Issue 5, May-2023
Services for Science and Education – United Kingdom
Large datasets can support both causal and non-causal predictive models. Nevertheless, "big
data analytics", of which artificial-intelligence (AI) based predictive models are an example,
typically take advantage of big data by either (1) using causal predictors in an especially
complex fashion -- for example, by replacing assumptions that the predictors operated in a
simple linear fashion with interaction terms, sophisticated smoothing functions, etc.; or (2)
using non-causal modeling techniques with a wide variety of potential modeling structures. In
either case, large numbers of candidate models are fit, and the models which best fit the data
are retained as the final candidates, perhaps to be further culled using additional criteria.
HOW ARE PREDICTIVE MODELS VALIDATED?
As is the case for any statistical model, AI-based predictive models are at risk of "overfitting".
In other words, by retaining those models which best fit the data the modeler is not only
optimizing the "signal" representing the actual causal relationship between the predictors and
the outcome, but also to the "noise" representing the peculiarities of the data caused by the play
of chance [6]. As a general statistical principle, the greater the number of candidate models
being considered, the greater their complexity, and the greater the number of analytical steps
within the model-fitting process, the greater the likelihood that model will be overfit and thus
perform less well subsequently. In that sense, the strengths of big data analytics are also its
greatest potential weakness.
Modelers recognize this, and thus place especial emphasis on "model validation"[7]. For clarity
of exposition, we will assume that (1) the predictive model in question has a simple and
transparent structure such as linear regression; (2) its predictors are either causal or non- causal; and (3) it was selected from a very large number of candidate models. The general
principles described here apply to "AI-based predictive models" more generally.
Model validation techniques can be classified according to whether or not an additional dataset
is available: if not, the validation is "internal" and if so, the validation is "external". For external
validation, the model is developed on a "training set" and its performance is assessed using a
"test set". Because performance in the training set can be overstated, the training set is only
used for model development with the test set being used to assess model performance. In
essence, a wall is built between the two datasets.
Internal validation techniques are modeled after techniques for external validation. For
example, the single database used to derive the predictive model might be randomly split, with
90% of observations in the "training set" and the remaining 10% in the "test set" and a measure
of model performance such as an R-squared statistic obtained. Then, another 90:10 split is
implemented, and model performance assessed on the 10% "test set". Finally, the measures of
model performance are averaged, and the result is a measure of model performance that
adjusts for overfitting. Numerous variations on this overall theme exist, but the underlying idea
is the same. Of course, if the dataset is sufficiently large a single division into training and test
sets can be made, and techniques of external validation applied directly.
Both internal and external validation techniques can estimate the degree of overfitting.
External validation techniques can also be used to assess generalizability. For that purpose, the
more different are the training and test sets the better. For a causally-based model, successful