Loading lesson path
When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the test set can lead information leakage causing the model to preform worse on unseen data. To correct for this we can perform cross validation. To better understand CV, we will be performing different methods on the iris dataset. Let us first load in and separate the data. from sklearn import datasets
Formula
X, y = datasets.load_iris(return_X_y = True)
There are many methods to cross validation, we will start by looking at k - fold cross validation.K -Fold The training data used in the model is split, into k number of smaller sets, to be used to validate the model. The model is then trained on k-1 folds of training set. The remaining fold is then used as a validation set to evaluate the model. As we will be trying to classify different species of iris flowers we will need to import a classifier model, for this exercise we will be using a DecisionTreeClassifier. We will also need to import CV modules from sklearn. from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import KFold, cross_val_score With the data loaded we can now create and fit a model for evaluation.
Formula
clf = DecisionTreeClassifier(random_state = 42)Now let's evaluate our model and see how it performs on each k -fold.
Formula
k_folds = KFold(n_splits = 5)
scores = cross_val_score(clf, X, y, cv = k_folds)It is also good pratice to see how CV performed overall by averaging the scores for all folds.
from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import KFold, cross_val_score
Formula
X, y = datasets.load_iris(return_X_y = True)
clf = DecisionTreeClassifier(random_state = 42)
k_folds = KFold(n_splits = 5)
scores = cross_val_score(clf, X, y, cv = k_folds)print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))In cases where classes are imbalanced we need a way to account for the imbalance in both the train and validation sets. To do so we can stratify the target classes, meaning that both sets will have an equal proportion of all classes.
Example from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import StratifiedKFold, cross_val_score
Formula
X, y = datasets.load_iris(return_X_y = True)
clf = DecisionTreeClassifier(random_state = 42)
sk_folds = StratifiedKFold(n_splits = 5)
scores = cross_val_score(clf, X, y, cv = sk_folds)print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))While the number of folds is the same, the average CV increases from the basic k-fold when making sure there is stratified classes.
Formula
Leave - One - Out (LOO)Instead of selecting the number of splits in the training data set like k-fold LeaveOneOut, utilize 1 observation to validate and n-1 observations to train. This method is an exaustive technique.
from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import LeaveOneOut, cross_val_score
Formula
X, y = datasets.load_iris(return_X_y = True)
clf = DecisionTreeClassifier(random_state = 42)
loo = LeaveOneOut()
scores = cross_val_score(clf, X, y, cv = loo)print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))We can observe that the number of cross validation scores performed is equal to the number of observations in the dataset. In this case there are 150 observations in the iris dataset. The average CV score is 94%.
Formula
Leave - P - Out (LPO)Leave-P-Out is simply a nuanced diffence to the Leave-One-Out idea, in that we can select the number of p to use in our validation set.
from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import LeavePOut, cross_val_score
Formula
X, y = datasets.load_iris(return_X_y = True)
clf = DecisionTreeClassifier(random_state = 42)
lpo = LeavePOut(p = 2)
scores = cross_val_score(clf, X, y, cv = lpo)print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
As we can see this is an exhaustive method we many more scores being calculated than Leave-One-Out, even with a p = 2, yet it achieves roughly the same average CV score.KFold, ShuffleSplit leaves out a percentage of the data, not to be used in the train or validation sets. To do so we must decide what the train and test sizes are, as well as the number of splits.
from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import ShuffleSplit, cross_val_score
Formula
X, y = datasets.load_iris(return_X_y = True)
clf = DecisionTreeClassifier(random_state = 42)
ss = ShuffleSplit(train_size = 0.6, test_size = 0.3, n_splits = 5)
scores = cross_val_score(clf, X, y, cv = ss)print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))These are just a few of the CV methods that can be applied to models. There are many more cross validation classes, with most models having their own class. Check out sklearns cross validation for more CV options.