Photo by Ryan Wallace on Unsplash

Chapter — V:

Cross-Validation.

The Story so far

In the first part of this study, I analyzed why recessions are important and how they affect returns in the Equity Market (S&P500). The second chapter’s focus has been on the creation of the dataset and its EDA. In the third part, I tackled the issue of dimensionality reduction by shortlisting two strategies to address it (SelectFromModel with LogisticRegression and PCA) to be used later on in the construction of an ML pipeline. In Chapter IV, I explored data scaling strategies and, at the end of the analysis, I selected a set of scalers to apply in the analysis (Quantile-Transformer and MinMaxScaler). Now, in this chapter, I will set out the best cross-validation strategy to use in this study, outlining the issues and the challenges hidden in this step. Cross-Validation strategy is fundamental as it is a pivotal component in the process of fine-tuning models and to understand how well they generalize the data.

Index:

5.1 Cross-Validation — The Problem;
5.2 Cross-Validation — Solutions;
5.3 Cross-Validation — Strategies Testing;
5.4 Cross-Validation and Hyperparameters;
5.5 Conclusion.

5.1 Cross-Validation — The Problem

At the end of the former chapter, I analyzed some models’ generalization performance over a validation dataset. Some of my few readers might have jumped on their chair looking at the very high validation scores (re-proposed in tab.1) and started to set-up a macro hedge-fund relying on these models to predict recessions. Not so fast. What if I have been just lucky (or unlucky, maybe results, in the end, will be even higher)?

Tab.1: Stellar Results… let’s trade! … no, wait…

When I split the dataset between training and validation set, even if it was random, the outcome might have been a validation set very easy to classify. Or, even worse, since I am dealing with an unbalanced dataset (only 14% of the data are recessions), what if the validation set has no or very few recession observation? How can I evaluate a model ability to spot downturns if my valuation set has no recessions at all?

Moreover, there are some considerations with the nature of the data in the database: Time Series. With time-series data, particular care must be taken in splitting the data to prevent data leakage. For instance, if data from the 60s are withheld for validation purpose, and I use data from the 50s and the 70s to fit the model, I am intrinsically leaking information ‘in the past’. Data from the 70s might have information about features interaction that was not available in the 60s. A proper cross-validation framework for time series should consider the real-world forecasting conditions, in which a data scientist stands in the present, and the future is unknown. Then the forecaster must withhold all data about events that occur chronologically in the future for fitting the model.

To sum up, for this study, in the cross-validation strategy, two main issues have to be considered:

  1. Imbalanced dataset: it might be possible that our validation set has too few or no observation belonging to the minority class; and
  2. Information Leaking: information from future observations cannot be passed over to a past validation set.

Goal: In this chapter, I aim to spot the best Cross-Validation strategy to evaluate Machine Learning algorithms.

5.2 Cross-Validation — Solutions

In order to address the two issues mentioned above the first cross-validation strategy employed is the TimeSeriesSplit provided by SCiKit-Learn. This is a specific cross-validator to apply for time series. Supposing that N-folds cross-validation is performed then the cross-validation iterations will look like to the following path:

  • fold 1: training [1], test [2]
  • fold 2 : training [1, 2], test [3]
  • fold 3 : training [1, 2, 3], test [4]
  • fold 4: training [1, 2, 3, 4], test [5]
  • …………
  • fold N: training [1, 2, 3, 4, 5, … N-1], test [N]

Applied on the dataset I am using, this process can be represented as in Pic n.1 on the left. The first five bars (0–4) represent the CV folds while the bar at the bottom is the target variable (recession periods are marked in brown). This approach fully addresses the ‘information leaking’ issues. In each CV iteration, all the training dataset is preceding the validation set. Unfortunately, the first issue is only partially addressed. Folds 0, 3 and 4 have very few recession periods in their validation set. The third validation set, for instance, captures only the early ’90s recession that was particularly mild and short. Even by reducing the number of folds from 5 to 3 does not help much. The last fold validation sample would cover only the 2001 recession.

Pic.1 — Comparison of Cross-Validation strategies.

An alternative approach that may be considered is StratifiedKFold-Cross-Validation. It resembles the basic CV method, but it ensures that each validation set has the same frequency of the minority class (recessions) as the whole sample. A graphic representation of this approach is presented in Pic.1 on the right. In this way, any valuation set will have about 14% of the sample being recession periods. Sad to say, the information leaking issue here is completely ignored.

To understand the trade-off between these two strategies, both CV-methods will be tested on some models using the training dataset. In particular, I will try to assess the impact of the ‘information leaking’ issue.

5.3 Cross-Validation — Strategies Testing

To analyze the two CV-strategies, it is now necessary to group together the steps into a pipeline. This will simplify and streamline the code. As a result, the Cross-validation will be applied on the following sequence:

  • Data Scaling — MinMaxScaler and QuantileTransformer-Uniform
  • Feature Selection — SelectFromModel with RandomForest and PCA.
  • Test Models: KNN, LogisticRegression and LinearSVC.

As mentioned before, two CV-Strategies are tested:

  • TimeSeriesSplit-Cross-Validation; and
  • StratifiedKFold-Cross-Validation.

5.3.1 Testing the CV Strategies

Here below, I re-produce the main steps of testing the two CV-strategies and the relative code in Python.

  • First Dictionaries for data scaling, feature selection, test models and Cross-Validation strategies are created:
  • A Multi-Index DataFrame to store the results is generated:
  • Now the Cross-Validation strategies are tested. Results are stored in the DataFrame as they are created:

Results are then reported in tab.2: based on the accuracy score, there seems to be no significant difference between the strategies used. If I base my conclusions on these data, I would be tempted to infer that data leaking from future data is irrelevant. However, anticipating the topic of the next chapter, accuracy can be downright misleading when dealing with a dataset affected by unbalanced classes (expansions outnumbering recessions with an 8:1 ratio).

Tab.2 Results using accuracy: no sign of information leaking…

To have a glance into how much accuracy can be misleading in our case, the two CV-strategies are now implemented using Average Precision metric to assess models’ performances. The results are reported in the following table:

Tab.2 Results using Average Precision: information leaking gotcha!
Tab.3 — Average Results

As the table shows, the Average Precision metric tells a completely different story: information leakage from future observation does exist! When the average data are considered, see table.3, models using Stratified K-fold outperform the same models using TimeSeriesSplit approach regardless of the feature selection model adopted. The overall impact is about 3% improvement in the Average Precision metric when features are selected with a Logistic Regression model and 5% when PCA is used.

5.4 Cross-Validation and Hyper-parameters

Cross-Validation is very useful to evaluate how well a model generalizes the data. The next step is to improve the models’ generalization performance by tuning their hyper-parameters. Here again, using a single validation set to achieve this goal, it would leave the process exposed to the risk specific nature of that subset. Using cross-validation in the same fashion as described above provides a sound framework to fine-tune a model. As a result, when I will proceed to fine-tune a model, the GridSearchCV function provided from SciKit-Learn library will be used. The GridSearchCV function computes a scoring metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. In the end, it returns the best set of hyper-parameters that generalizes predictions over the validation sets. I will explore here a simplified example of hyper-parameter fine-tuning. The Pipeline used is made up of the following steps:

  • Scaler: QuantileTransformer — Uniform
  • Features Selection: PCA — 16 Features;
  • Model: SVC with RBF Kernel; and
  • Score Metric: accuracy.

I will try to adjust the model using two HyperParameters:

  1. C = Regularization Parameter; and
  2. Gamma: Kernel parameter.

The Cross-Validation approach:

  • Strategy: TimeSeriesSplit
  • Number of Splits: 5 Folds.

The code to implement the analysis is the following:

The analysis shows that the model delivers the best results on validation data when the parameter are set at:

  • C= 100
  • Gamma= 1.0

Since I am dealing with only two parameters, I can also represent the results in a matrix (Pic.2). It shows all the cross-validation results within the parameter grid:

Pic.2 GridSerachCV results using accuracy

Another important feature of GridSearchCV is the possibility to define the score that a user wants to maximize. By default, the function uses the accuracy. However, this metric is not ideal for all kind of problems, and different metrics can deliver different results. I will explore the valuation metrics problem thoroughly in the next chapter. Still, in this chapter, I will show how GridSearchCV outcome is affected by a different valuation metric. By changing the score function to Average Precision, for instance, I obtain the following results:

  • C= 0.001
  • Gamma= 1.0

That’s quite a change! The regularization parameter changed from 100 to 0.001. Again, I provide a graphic representation:

Pic.3 GridSerachCV results using Average Precision.

5.5 Conclusion

Information leaking from future data is a major issue that can invalidate the analysis. Moreover, with regards to the issues stemming from the imbalanced nature of the dataset, despite the uneven number in the minority class observations from fold to fold, when the TimeSeriesSplit-CV is used, each fold has some samples of the minority class. Based on these two considerations, TimeSeriesSplit-CV is the best strategy to utilize for evaluation purpose of the ML algorithm. For the purpose of this study, my analysis has shown that Stratified K-Fold is fatally flawed with information leaking and leans to overestimate the generalization proprieties of models.

Other Articles of the Serie:


Forecasting a Recession in the USA was originally published in Data Driven Investor on Medium, where people are continuing the conversation by highlighting and responding to this story.