What is repeated stratified k-fold cross-validation?
In our previous articles, we discussed k-fold cross-validation, repeated k-fold cross-validation, and stratified k-fold cross-validation. We discussed that in stratified k-fold cross-validation, stratified folds are returned. In other words, the dataset is split into k folds in such a way that each set contains approximately the same ratio of the target variable as the complete dataset.
In repeated stratified k-fold cross-validation, the stratified k-fold cross-validation is repeated a specific number of times. Each repetition uses different randomization. As a result, we get different results for each repetition. We can then take the average of all the results. As each repetition uses different randomization, the repeated stratified k-fold cross-validation can estimate the performance of a model in a better way.
Repeated Stratified K-Fold Cross-Validation using sklearn in Python
We can use the following Python code to implement repeated stratified k-fold cross-validation.
import pandas from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression dataset = pandas.read_csv("diabetes.csv") D = dataset.values X = D[: :-1] y = D[:, -1] model = LogisticRegression(solver="liblinear") cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1) scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy") print("Accuracy: ", scores.mean())
Here, we are first reading the Pima Indians Diabetes dataset using the pandas Python library. The Pima Indians Diabetes dataset contains information such as plasma glucose concentration, blood pressure, serum insulin, etc. Based on these features a machine learning model can predict whether a patient has diabetes.
dataset = pandas.read_csv("diabetes.csv") D = dataset.values X = D[: :-1] y = D[:, -1]
The last column of the dataset contains the target variable. So, X contains all the features and y contains the target variable.
model = LogisticRegression(solver="liblinear") cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1) scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
Now, we initialize the model. We are using logistic regression to solve this problem. Then, we initialize repeated stratified k-fold cross-validation. Here, n_splits refers the number of splits. n_repeats specifies the number of repetitions of the repeated stratified k-fold cross-validation. And, the random_state argument is used to initialize the pseudo-random number generator that is used for randomization.
Now, we use the cross_val_score() function to estimate the performance of the model. We are using an accuracy score here (What is the accuracy score in machine learning?) We get an accuracy score for each repetition of the repeated stratified k-fold cross-validation. We are printing the average accuracy score.
The output of the given program will be:
Accuracy: 0.6466302118933698






0 Comments