What is the homogeneity score in machine learning?
The homogeneity score is a metric using which we can measure clustering performance in machine learning. A clustering result is said to be homogeneous if it contains data points that are members of a single class.
A homogeneity score is a number between 0 and 1. A low value indicates low homogeneity and a high value indicates high homogeneity. A homogeneity score of 1 indicates perfectly homogeneous labeling.
Please note that the homogeneity score is independent of the absolute values of the labels. If we permute the labels, the homogeneity score will remain the same.
How to measure clustering performance using homogeneity score in sklearn?
Let’s read the iris dataset. The dataset contains four features based on which we can determine the type of flowers. Let’s perform k-means clustering on the dataset and measure the clustering performance using the homogeneity score. We can use the following Python code for that purpose:
from sklearn.cluster import KMeans from sklearn.metrics import homogeneity_score import seaborn from sklearn.preprocessing import LabelEncoder df = seaborn.load_dataset("iris") df_features = df.drop(labels=["species"], axis=1) df_target = df.filter(items=["species"]) print(df.head()) encoder = LabelEncoder() df_target["species"] = encoder.fit_transform(df_target["species"]) kmeans = KMeans(n_clusters=3, random_state=1) y_pred = kmeans.fit_predict(df_features) score = homogeneity_score(df_target["species"], y_pred) print("Homogeneity Score: ", score)
Here, we are first reading the iris dataset using the seaborn library. Then, we are splitting the dataset into features and target. df_features contain all the features of the dataset, and we will use this to cluster the dataset. df_target contains the output labels in the “species” column…
0 Comments