df_features = df.drop(labels=["Outcome"], axis=1) df_target = df.filter(items=["Outcome"])
Now, we are splitting the dataset into train and test set. The size of the test set is 20% of the dataset. And we are shuffling the dataset before splitting the dataset. The random_state=1 parameter is used to control the random number generator that is used for shuffling.
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target["Outcome"], test_size=0.2, shuffle=True, random_state=1)
Now, we are initializing the classifier. The DecisionTreeClassifier() constructor takes the parameter random_state=1 to control the randomness of the estimator. Please note that the features are randomly permuted at each split.
The fit() method learns from the dataset. And the predict() method is used to predict the target label for the test set.
classifier = DecisionTreeClassifier(random_state=1) classifier.fit(X_train, y_train) y_test_pred = classifier.predict(X_test)
Now, we can compare y_test_pred with y_test and measure the performance of the model. Here, we will use the accuracy score (What is the accuracy score?).
accuracy = accuracy_score(y_test, y_test_pred)
The output of the above program will be like the following:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB None Pregnancies Glucose BloodPressure ... DiabetesPedigreeFunction Age Outcome 0 6 148 72 ... 0.627 50 1 1 1 85 66 ... 0.351 31 0 2 8 183 64 ... 0.672 32 1 3 1 89 66 ... 0.167 21 0 4 0 137 40 ... 2.288 33 1 [5 rows x 9 columns] Pregnancies Glucose BloodPressure ... BMI DiabetesPedigreeFunction Age 0 6 148 72 ... 33.6 0.627 50 1 1 85 66 ... 26.6 0.351 31 2 8 183 64 ... 23.3 0.672 32 3 1 89 66 ... 28.1 0.167 21 4 0 137 40 ... 43.1 2.288 33 [5 rows x 8 columns] Outcome 0 1 1 0 2 1 3 0 4 1 Accuracy Score: 0.6948051948051948
As we can see, the accuracy score is good but not great. In our next article, we will use a random forest classifier to improve this accuracy score.








































0 Comments