What is repeated k-fold cross-validation?
In our previous article, we discussed what k-fold cross-validation is, how it works, and how to perform k-fold cross-validation using the sklearn Python library. In the repeated k-fold cross-validation algorithm, the k-fold cross-validation is repeated a certain number of times. Each repetition uses different randomization. The algorithm estimates the performance of the model in each repetition. And finally, we take the average of all the estimates.
The single run of the k-fold cross-validation algorithm may provide a noisy estimate. As the repeated k-fold cross-validation technique uses different randomization and provides different results in each repetition, repeated k-fold cross-validation helps in improving the estimated performance of a model.
Repeated K-Fold Cross-Validation using Python sklearn
We can use the following Python code to implement repeated k-fold cross-validation.
import seaborn from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import cross_val_score dataset = seaborn.load_dataset("iris") D = dataset.values X = D[:, :-1] y = D[:, -1] model = KNeighborsClassifier() cv = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1) results = cross_val_score(model, X, y, scoring="accuracy", cv=cv) print("Accuracy: ", results.mean())
Here, we read the iris dataset using the seaborn library. The dataset contains the sepal length, sepal width, petal length, and petal width of flowers. A machine learning model can predict the species of flower based on these features. The last column of the dataset contains the target variable.
dataset = seaborn.load_dataset("iris") D = dataset.values X = D[:, :-1] y = D[:, -1]
Now, we split the columns of the dataset into features and the target variable. The last column of the dataset contains the target variable. So, X here contains features. And y contains the target variable.
model = KNeighborsClassifier() cv = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1) results = cross_val_score(model, X, y, scoring="accuracy", cv=cv)
We are using the K Nearest Neighbor classifier for this problem. We are initializing the model with the KneighborsClassifier class.
Then, we are initializing repeated k-fold cross-validation. The argument n_splits refers to the number of splits in each repetition of the k-fold cross-validation. And n_repeats specifies we repeat the k-fold cross-validation 5 times. The random_state argument is used to initialize the pseudo-random number generator that is used for randomization.
Finally, we use the cross_val_score( ) function to estimate the performance of the model. Here, we are using the accuracy score (What is the accuracy score in machine learning?) And we are printing the average accuracy score.
The output of the given program will be like the following:
Accuracy: 0.9640000000000001






0 Comments