Holdout Sets: Good or Bad?

My friend Tom Reilly of Automatic Forecasting Systems posted this comment on the INFORMS discussion group on LinkedIn:

Some use all of the data and some withhold data to find the best forecasting model? Withholding is arbitrary as changing the withhold from x to y means a co… mpletely different model and forecast. How many observations to withhold? Who can say? There is nothing objective about this approach. We say use all of the data. We call this approach “tail wagging the dog”. Look under the hood of your forecasting engine and see what it's doing. You'd be surprised what you find.

Tom brings up an interesting issue which merits discussion. I agree with the importance of knowing what’s under the hood of your forecasting engine — there may be some unsavory character like this cranking out your forecasts:

However, I’m going to argue in favor of using holdout sets as part of the model selection process. But first some background…

Withholding data means dividing your history into what is commonly called an “initialization” set, and a “holdout” set. For example, if you have 48 months of demand history, you could use the oldest 36 months as your initialization set, and develop your forecasting model based on that data alone. You then test the performance of the model over the most recent 12 months that were withheld from the model development process – that is, you see how well the model forecast those 12 months. If the model performed satisfactorily, you might gain some confidence that you’ve correctly captured any systematic pattern in the data, and then can refit the model parameters based on the full 48 months of data and begin forecasting the future. (Once you have some confirmation of the appropriateness of the model, then I would agree that it is correct to refit model parameters based on the full historical data.)

On the other hand, if the model did poorly in forecasting the 12 month holdout set, this is a good indication that there is no ongoing systematic pattern in the data, or else that you completely misinterpreted the pattern with your model. You may try to again seek out the pattern in the data and build a new model, or else concede that there is no stable pattern. (When there is no systematic pattern in the data, the most appropriate model may be something very simple, like a moving average, seasonal random walk, or single exponential smoothing. In this type of situation – when there is no pattern – a very simple model is probably the best choice. Such a model probably won’t forecast particularly well, but will likely outperform a more elaborate model that is propagating false patterns that aren’t really there.)

As a practical matter, there is often not enough history available to utilize a holdout. If you only have 6 periods of history, you probably need to use all of it for model development. However, when you do have sufficient history to create a holdout set, there are at least two good reasons to do so:

1) Model performance over the holdout set may provide some indication of how accurate future forecasts will be (or at least be a better indicator of future forecast accuracy than the fit of the model to history).

When we construct a forecasting model using all of the available historical data, the fit of the model to history provides little indication of how well the model is going to forecast the future. Utilizing all of the history makes it more likely to “overfit” the model – fitting to randomness rather than any systematic pattern in the data. It is always possible to construct a model with perfect fit to history, but can we then expect this model to forecast the future with 100% accuracy? Of course not. A model’s fit to history will nearly always be better (and often very much better) than the accuracy of its forecasts of the future. However, by withholding the most recent history (dividing history into an “initialization” set and a “holdout” set) and building the model based only on the “initialization” set, we can see how well the model truly forecasts over the “holdout” set. While model performance over the holdout sample is still a far-from-perfect indicator of how well the model will forecast the future, at least this should be a better indicator than the fit of a model built solely from the full history.

2) Getting completely different models by using different initialization and holdout sets is very valuable information!

It is true that you can get completely different models based on your choice of the holdout period. For example, with 48 months of history we could withhold the most recent 12 months and come up with model A, withhold the most recent 11 months and come up with a different model B, withhold the most recent 10 months and come up with a different model C, and so on. But what is this telling us? The message the data is speaking is that there is no ongoing systematic structure. If the model completely changes every time we change the holdout set, this tells us that none of the models are correct – that the behavior we are trying to forecast is not stable over time and we should have little expectation for modeling it correctly. (On the other hand, if we end up with essentially the same model no matter what initialization set is used, and the model performs well over the holdout, this could give us confidence that there is systematic structure in the data and that our model is capturing it correctly.) If we did not use holdout sets, but instead only built a model based on the full history, we can be fooled into thinking our model is correct. This leads to unwarranted confidence in the accuracy of future forecasts – which can result in bad business decisions.

It is unfortunate that we often lack sufficient history to utilize a holdout set as part of the model development process. But when we do have enough history, there is actually a lot of useful information that can be derived by this approach.

(Illustration by Jessica Crews: jlcrews25@yahoo.com)