Cross-Validation in Finance, Challenges and Solutions

Table of Contents

The Shortcomings of Ordinary Cross-Validation in Finance
K-Fold Cross-Validation: A Closer Look
Overcoming Challenges: Purging and Embargo
Purging
Embargo
Purged K-Fold Class in RiskLabAI
References

Cross-Validation in Finance: Challenges and Solutions

The Shortcomings of Ordinary Cross-Validation in Finance

In traditional settings, cross-validation is an effective tool for evaluating a machine learning model's performance. However, the complexities of financial data pose unique challenges:

Data Dependency: Financial observations are often not independently and identically distributed (IID), contradicting a key assumption of cross-validation.
Repeated Testing: Using the test set multiple times during model development can lead to selection bias.
Data Leakage: This occurs when training and testing datasets share information, affecting the model's predictive accuracy.

K-Fold Cross-Validation: A Closer Look

In k-fold cross-validation, the data is partitioned into $k$ subsets. One subset is used for validation, while the rest are used for training. This is repeated $k$ times, and the performance metrics are averaged.

Overcoming Challenges: Purging and Embargo

Purging

To mitigate the issue of data leakage, one solution is "purging." Purging involves eliminating observations from the training set that have labels overlapping in time with those in the testing set.

Python	Julia
`def purged_train_times( data: pd.Series, test: pd.Series ) -> pd.Series:`	`function purgedTrainTimes( data::TimeArray, test::TimeArray ) ::TimeArray`

View More: Python | Julia

Embargo

An additional step, known as "embargo," can be implemented to further eliminate data leakage. This involves excluding observations from the training set that immediately follow an observation in the testing set.

Python	Julia
`def embargo_times( times: pd.Series, percent_embargo: float ) -> pd.Series:`	`function embargoTimes( times::Array, percentEmbargo::Float64 )::TimeArray`

View More: Python | Julia

Purged K-Fold Class in RiskLabAI

When building a machine learning model, it's essential to avoid data leakage between the training and test sets. The Purged K-Fold method in RiskLabAI is designed for this purpose. It takes into account parameters like the number of K-Fold splits, observation times, and the size of the embargo.

Python Julia

Python	Julia
`class PurgedKFold(KFold): def __init__(self, n_splits: int, times: pd.Series, percent_embargo: float ): def split(self, data: pd.DataFrame, labels: pd.Series=None, groups=None ):`	`mutable struct PurgedKFold nSplits::Int64 times::TimeArray percentEmbargo::Float64 end function purgedKFoldSplit(self::PurgedKFold, data::TimeArray )::Tuple end`

class PurgedKFold(KFold):
def __init__(self,
n_splits: int,
times: pd.Series,
percent_embargo: float
):
def split(self,
data: pd.DataFrame,
labels: pd.Series=None,
groups=None
):

mutable struct PurgedKFold
nSplits::Int64
times::TimeArray
percentEmbargo::Float64
end
function purgedKFoldSplit(self::PurgedKFold,
data::TimeArray
)::Tuple
end

View More: Python | Julia

These functionalities are available in both Python and Julia in the RiskLabAI library.

Python Julia

Python	Julia
`def cross_validation_score(classifier: ClassifierMixin, data: pd.DataFrame, labels: pd.Series=None, sample_weight: np.ndarray=None, scoring='neg_log_loss', times: pd.Series=None, n_splits: int=None, cross_validation_generator: BaseCrossValidator=None, percent_embargo: float=0.0 ) -> np.array:`	`function crossValidationScore(classifier, data::TimeArray, labels::TimeArray, sampleWeights::Array, scoring::String, times::TimeArray, crossValidationGenerator::PurgedKFold, nSplits::Int, percentEmbargo::Float64 )::Array`

def cross_validation_score(classifier: ClassifierMixin,
data: pd.DataFrame,
labels: pd.Series=None,
sample_weight: np.ndarray=None,
scoring='neg_log_loss',
times: pd.Series=None,
n_splits: int=None,
cross_validation_generator: BaseCrossValidator=None,
percent_embargo: float=0.0
) -> np.array:

function crossValidationScore(classifier,
data::TimeArray,
labels::TimeArray,
sampleWeights::Array,
scoring::String,
times::TimeArray,
crossValidationGenerator::PurgedKFold,
nSplits::Int,
percentEmbargo::Float64
)::Array

View More: Python | Julia

These functionalities are available in both Python and Julia in the RiskLabAI library.

References

De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.