What is the V-measure in clustering?
The V-measure is a metric that can measure the clustering performance in machine learning. In our previous articles, we discussed what completeness score (What is completeness score?) and homogeneity score (What is homogeneity score?) is. We learned that the completeness score of clustering indicates whether all the data points that are members of a given class belong to the same cluster. And a clustering result is said to be homogeneous if it contains data points that are members of a single class. Both completeness score and homogeneity score are numbers between 0 and 1. The V-measure is the harmonic mean between the homogeneity score and the completeness score.
The V-measure score is also a number between 0 and 1. A number close to 1 means more perfect labeling. Please note that the V-measure score is independent of the absolute values of the labels. If we permute the labels, the V-measure score will remain the same. Moreover, this metric is also symmetric. If we switch the true labels and the predicted labels, the v-measure score will remain the same.
How to measure clustering performance using V-measure in sklearn?
In this article, we will look into an example. We will read the iris dataset and perform k-means clustering. After that, we will compare the true labels with the predicted labels and calculate the V-measure score using sklearn.
We can use the following Python code for this purpose:
from sklearn.cluster import KMeans from sklearn.metrics import v_measure_score import seaborn from sklearn.preprocessing import LabelEncoder df = seaborn.load_dataset("iris") df_features = df.drop(labels=["species"], axis=1) df_target = df.filter(items=["species"]) print(df.head()) encoder = LabelEncoder() df_target["species"] = encoder.fit_transform(df_target["species"]) kmeans = KMeans(n_clusters=3, random_state=1) y_pred = kmeans.fit_predict(df_features) score = v_measure_score(df_target["species"], y_pred) print("V-measure Score: ", score)
Here, we are first reading the iris dataset and separating the features from the labels. df_features contains the features from the dataset, and df_target contains the labels…
0 Comments