K-fold cross validation for time series in R


I recently wrote a little package (kfoldcv4ts) that assesses the fit of Vector Autoregressive (henceforth VAR) models for forecasting by performing a k-fold cross validation adapted to timeseries. I thought that it may be of interest to some people, so I decided to write a little article and share it. As of now, the function kfoldcv4ts::accuracy.kfold has only been tested with varest objects from the package vars. However, the function being quite simple, it should not be a problem to adapt it to other types of forecasting model.

In this article, I provide a little summary of k-fold cross validation and an application in R using a dataset consisting of several US macroeconomic variables (from Stock and Watson (2007)). For illustration purpose, I evaluate the relative performance of several competing VAR models for predicting quarterly US GDP using a k-fold cross validation approach adapted to timeseries.

I provide the commands to install my package kfoldcv4ts that allows to perform the k-fold cross validation for timeseries (function accuracy.kfold). The package makes use of several existing packages, most notably the excellent forecast and vars package.


Disclaimer: the function is meant to be easy to read such that the user can follow along each step of a k-fold cross validation. It is by no means meant to be efficient. Comments are welcome.

To follow along the steps of this article, you need to install/load the following packages: dplyr, vars, forecast, and AER. Most of these packages are also required for installing the kfoldcv4ts package.

Some theory

Judging the performance of a forecasting model is absolutely critical. A natural way to do so is to look at how well the model predictions match with the observed values. There are a variety of ways to formally quantify the performance of a forecasting model.1 Maybe the most popular is the MSE (or RMSE) which stands for (root) mean squared error. Mathematically, MSE is given by: \[MSE = \frac{1}{n} \sum_{i}^n(y_i-\hat{f}(x_i))^2\] Where \(n\) is the number of observations, \(y_i\) is the observed data, and \(\hat{f}(x_i)\) is the estimated forecasting model expressed as a function of a number of regressors \(x_i\) that help predict \(y_i\).

Training versus test data

To evaluate the performance of a model, it is important to assess the fit on data that has not been used for estimation (out-of-sample forecasting). The reason for this is that it is very simple to improve in-sample fit by adding regressors, but this may result in overfitting and thus poor performance when the model is confronted to new data.2

It is thus natural to split the data in two parts: a training and a test dataset. The former is used to estimate the model’s parameters, and the latter to evaluate how well the model performs when confronted with new data.

K-fold cross validation

K-fold cross validation is precisely a way to split the data in a training and test dataset. The easiest way to understand k-fold cross validation is certainty to look at a graphical representation:

Graphical representation of K-fold cross validation. Source: Stackexchange (Jatin Garg)

In words, k-fold cross validation amounts to separate the data in k training and testing datasets. For each fold, we use the training dataset to estimate the models’ parameters, and the test dataset to evaluate its out-of-sample performance. This allows to replicate a real life situation where the data used to estimation are not the ones we want to predict. Additionally, it also gives information on how new data affect the performance of the model. A MSE that does not decrease with each additional fold can be suggestive that the forecasting model is not appropriate for the data studied.

Application to US GDP in R

We use macroeconomic time series from the AER package. The dataset (USMacro) is from the Stock and Watson (2007). It consists of 7 time series from 1957 to 2005 available at the quarterly frequency. The following commands load the data in R:

data("USMacroSW", package = 'AER')

As an exercise, we decide to estimate a VAR consisting of the first 6 variables:

##      [,1]    [,2]  [,3]     [,4]    [,5]    [,6]    
## [1,] "unemp" "cpi" "ffrate" "tbill" "tbond" "gbpusd"
data_VAR = USMacroSW[, c(1:6)]

We store the optimal number of lags according to the AIC(n) criterion (equal to 6) :

opt.lags.AIC = vars::VARselect(data_VAR)$selection[['AIC(n)']]
opt.lags.HQ = vars::VARselect(data_VAR)$selection[['HQ(n)']] # for later use

We now use our function to perform a k-fold cross validation. The source code of the function can be found on my GitHub) or by running kfoldcv4ts::accuracy.kfold in the console once the package has been installed. In a few words, the function splits the data in k training and test folds. It then estimates a VAR on each fold and returns various out-of-sample measures for each fold. The last element of the list is the k-th fold and uses the whole dataset (minus the n_ahead last observations that are used as a test dataset) for the training of the model.

kfoldcv4ts::accuracy.kfold(df.VAR = data_VAR, k=3, n_ahead = 6, lags = opt.lags.AIC, var_index = 1)
## [[1]]
##                         ME       RMSE        MAE         MPE      MAPE
## Training set -4.269906e-17 0.07619774 0.05697584 -0.02457589  1.187002
## Test set      1.889559e+00 1.95627730 1.88955891 33.30258643 33.302586
##                    MASE
## Training set 0.06163578
## Test set     2.04410207
## [[2]]
##                         ME      RMSE       MAE         MPE      MAPE      MASE
## Training set  1.147934e-17 0.2072081 0.1559196  -0.1995576  2.702903 0.1191487
## Test set     -6.340175e-01 0.6735066 0.6340175 -11.2712537 11.271254 0.4844957
## [[3]]
##                         ME      RMSE       MAE        MPE     MAPE      MASE
## Training set  1.974212e-17 0.2028920 0.1585884 -0.1710733 2.790761 0.1393353
## Test set     -7.225542e-02 0.1404539 0.1159403 -1.2430540 2.047070 0.1018648

We now compare the following three models in terms of MSE (using k=3). model1 is the baseline method with the optimal number of lags chosen according to the AIC criterion while model2 uses the optimal number of lags according to the HQ criterion. model3 does not the 6th variable of the dataset to estimate the VAR.

model1 = kfoldcv4ts::accuracy.kfold(df.VAR = data_VAR, k=3, n_ahead = 6, lags = opt.lags.AIC, var_index = 1)
model2 = kfoldcv4ts::accuracy.kfold(df.VAR = data_VAR, k=3, n_ahead = 6, lags = opt.lags.HQ, var_index = 1)
model3 = kfoldcv4ts::accuracy.kfold(df.VAR = data_VAR[,1:5], k=3, n_ahead = 6, lags = 6, var_index = 1)

We summarise the results in the following table:

f.extract = function(x){x[2,2]}
m1 = rbind(lapply(model1, f.extract))
m2 = rbind(lapply(model2, f.extract))
m3 = rbind(lapply(model3, f.extract))

table_summary = data.frame(rbind(m1,m2,m3), row.names = c('Model 1', 'Model 2', 'Model 3'))
colnames(table_summary) = c('k=1','k=2','k=3') 
kableExtra::kable(table_summary, caption = 'MSE for each model and each fold') %>% 
Table 1: MSE for each model and each fold
k=1 k=2 k=3
Model 1 1.956277 0.6735066 0.1404539
Model 2 3.321254 0.3762237 0.1756277
Model 3 1.067212 0.4424066 0.1498588

We can note several interesting things. First, Model 1 seems to be the best performing in terms of MSE when estimated on the largest possible dataset (i.e. \(k=3\)). Additionally, we can note that more data significantly improves the fit of the model (from 1.96 to 0.14). We can also see that model 2 seems to perform particularly badly when the training dataset is relatively small (k=1). In terms of average MSE over the 3 folds, Model 3 seems to perform the best and may be considered as a good alternative to Model 1, even though it has one regressor less. This provides an example on the risk of overfitting.


Stock, James H, and Mark W Watson. 2007. Introduction to Econometrics.

  1. There exist other measures to assess the fit of a model such as the MAE (mean absolute deviation), MPE (mean percentage error) and certainly many others. The best measure is usually not absolute and often depends on the application at hand.↩︎

  2. For more information about this, read for example: https://en.wikipedia.org/wiki/Bias–variance_tradeoff↩︎