The contingency matrix is a simple and powerful metric using which we can measure the clustering performance in machine learning. A contingency matrix is an nxn matrix, where n is the number of clusters. A number ith row and jth column indicates the number of clusters that has true label i and assigned to cluster j. So, a perfect contingency matrix should be diagonal. Any number in any other cell indicates clustering error.
To summarize, the contingency matrix is a measure that describes the relationship between true labels and predicted labels in clustering.
How to measure clustering performance using a contingency matrix in sklearn?
In this article, we will look at an example of clustering. We will read the iris dataset, perform clustering on the dataset and measure the clustering performance by printing the contingency matrix of the clustering problem. We can use the following Python code for that purpose:
from sklearn.cluster import KMeans from sklearn.metrics.cluster import contingency_matrix import seaborn from sklearn.preprocessing import LabelEncoder df = seaborn.load_dataset("iris") df_features = df.drop(labels=["species"], axis=1) df_target = df.filter(items=["species"]) print(df.head()) encoder = LabelEncoder() df_target["species"] = encoder.fit_transform(df_target["species"]) kmeans = KMeans(n_clusters=3, random_state=1) y_pred = kmeans.fit_predict(df_features) matrix = contingency_matrix(df_target["species"], y_pred) print("Contingency Matrix: \n", matrix)
Here, we are first reading the iris dataset using the seaborn library. Then, we separate the features from the labels. df_features contains the features, and df_target contains the true labels…






0 Comments