Many statisticians see cross-validation as something data miners do, but not a core statistical technique. Rob Hyndman summarized the role of cross-validation in statistics and gave an example in R, which Udo Sglavo repeated in SAS.
A model’s fit statistics are not a good indicator of its prediction quality: a high fit does not necessarily mean a good model. “It is easy to over-fit the data by including too many degrees of freedom”, says Rob Hyndman.
“For example, in a simple polynomial regression I can just keep adding higher order terms and so get better and better fits to the data. But the predictions from the model on new data will usually get worse as higher order terms are added.”
To measure the predictive ability of a model you can test it on a set of data not used in estimation: the “test set” and the data used for estimation is the “training set”. The predictive accuracy of a model can be measured by the error on the test set. This will generally be larger than the training set because the test data were not used for estimation.
Rob was recently asked how to implement time series cross-validation in R, the free statistical software project: “Time series people would normally call this forecast evaluation with a rolling origin, or something similar, but it is the natural and obvious analogue to leave-one-out cross-validation for cross-sectional data, so I prefer to call it time series cross-validation“.
Udo Sglavo of the Advanced Analytics Division of SAS saw Rob’s example and showed how to conduct it using SAS Forecast Server. Udo replicated the example of Hyndman on The Business Forecasting Deal, a blog to help expose the seamy underbelly of the forecasting practice, and to provide practical solutions to its most vexing problems.
“Cross-validation is a computationally intensive approach to assessing forecast model performance, so this needs to be taken into account when trying to apply it on large scale data”, says Sglavo.