df = seaborn.load_dataset("iris") df_features = df.drop(labels=["species"], axis=1) df_target = df.filter(items=["species"])
Please note that the species column contains strings, and a machine learning model understands only numbers. So, we need to label encode the species column in df_target.
encoder = LabelEncoder() df_target["species"] = encoder.fit_transform(df_target["species"])
Now, we will perform k-means clustering on the dataset. Please note that we are dividing the data into three clusters. And the random_state=1 parameter controls the random number generator that is used to randomly initialize the three centroids of the clusters (How does k-means clustering work?)
The fit_predict() method learns from the dataset, performs clustering, and then assigns labels to each data point based on which cluster it belongs. So, y_pred contains the predicted labels.
kmeans = KMeans(n_clusters=3, random_state=1) y_pred = kmeans.fit_predict(df_features)
Now, we can compare y_pred with df_target[“species”] and measure the clustering performance using homogeneity score.
score = homogeneity_score(df_target["species"], y_pred)
The output of the above program will be:
sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa Homogeneity Score: 0.7514854021988339
0 Comments