the target variable. Please note that the last column of the dataset contains the target variable. So, X here contains all the features, and y contains the target variable.
data = pandas.read_csv("diabetes.csv") D = data.values X = D[:, :-1] y = D[:, -1]
Now, we are creating a list of tuples. Each tuple contains the name of an estimator and the estimator. One of the estimators is the standard scaler, and the other is a logistic regressor. We will use Logistic Regression to solve this classification problem.
estimators = list() estimator1 = StandardScaler() estimators.append(("Standard Scaler", estimator1)) estimator2 = LogisticRegression(solver="liblinear") estimators.append(("Logistic Regression Classifier", estimator2)) model = Pipeline(estimators)
After that, we are creating a pipeline with the estimators.
k_fold = KFold(n_splits=10, shuffle=True, random_state=1)
Now, we are initializing the k-fold cross-validation. n_splits here refers to the number of splits. The argument shuffle=True indicates that we are shuffling the data before splitting. And the argument random_state is used to initialize the pseudo-random number generator that is used for shuffling the data.
result = cross_val_score(model, X, y, scoring="accuracy", cv=k_fold) print("Accuracy: ", result.mean())
Now, we are using the cross_val_score() function to estimate the performance of the model. The standard scaler first standardizes the data (How does standardization work in machine learning?). And then, the logistic regression classifier runs on the dataset. We calculate the accuracy score for each iteration of the k-fold cross-validation (What is the accuracy score in machine learning?). And then, we are taking the mean of all the accuracy scores.
The above program will print the following output:
Accuracy: 0.7733424470266576






0 Comments