One of the concerns while solving a problem using machine learning is data leakage. While solving a machine learning problem, we split the whole dataset into training and test set. We need to be very cautious so that there is no data leakage from the training set to the test set. For example, if we standardize data, then the training set should not be influenced by the scale of the data in the test set.
We can use a pipeline to prevent such data leakage. For example, let’s say we are reading the Pima Indians Diabetes dataset. The dataset contains various predictor variables such as the number of pregnancies the patient has had, the BMI, insulin level, age, etc. A machine learning model can learn from the dataset and predict whether the patient has diabetes based on these predictor variables. Now, let’s say, we want to standardize the data in the dataset before a machine learning algorithm runs on the dataset.
We can use k-fold cross-validation along with a pipeline so that the training set does not get influenced by the scale of the data in the test set. We can use the following Python code for that purpose:
import pandas from sklearn.model_selection import KFold from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline data = pandas.read_csv("diabetes.csv") D = data.values X = D[:, :-1] y = D[:, -1] estimators = list() estimator1 = StandardScaler() estimators.append(("Standard Scaler", estimator1)) estimator2 = LogisticRegression(solver="liblinear") estimators.append(("Logistic Regression Classifier", estimator2)) model = Pipeline(estimators) k_fold = KFold(n_splits=10, shuffle=True, random_state=1) result = cross_val_score(model, X, y, scoring="accuracy", cv=k_fold) print("Accuracy: ", result.mean())
Here, we are first reading the Pima Indians Diabetes dataset and then, splitting the columns of the dataset into features and …
![Share on Facebook Facebook](https://www.thesecuritybuddy.com/wordpress/bdr/plugins/social-media-feather/synved-social/image/social/regular/64x64/facebook.png)
![Share on Twitter twitter](https://www.thesecuritybuddy.com/wordpress/bdr/plugins/social-media-feather/synved-social/image/social/regular/64x64/twitter.png)
![Share on Reddit reddit](https://www.thesecuritybuddy.com/wordpress/bdr/plugins/social-media-feather/synved-social/image/social/regular/64x64/reddit.png)
![Pin it with Pinterest pinterest](https://www.thesecuritybuddy.com/wordpress/bdr/plugins/social-media-feather/synved-social/image/social/regular/64x64/pinterest.png)
![Share on Linkedin linkedin](https://www.thesecuritybuddy.com/wordpress/bdr/plugins/social-media-feather/synved-social/image/social/regular/64x64/linkedin.png)
![Share by email mail](https://www.thesecuritybuddy.com/wordpress/bdr/plugins/social-media-feather/synved-social/image/social/regular/64x64/mail.png)
0 Comments