Let’s say we have a categorical column that contains strings. For example, let’s say we are reading the CSV file “iris.csv” in which there are five columns. The first four columns specify the sepal length, sepal width, petal length, and petal width of flowers. And, the last column specifies the species of the flowers. We are building a machine learning model that can predict the species of a flower based on the sepal length, sepal width, petal length, and petal width of the flower.
Now, after reading the dataset, we want to know the class distribution of the classification problem. So, if there are three classes all total, then we would want to know how many records contain each class. We can use the following Python code for that purpose:
import pandas
data = pandas.read_csv("iris.csv")
print(data.groupby("species").size())
Here, we first import the pandas module. After that, we read the CSV file “iris.csv”. Now, we can use the data.groupby(“species”).size() function to get the class distribution of the classification problem. This function will print how many rows contain each class in the classification problem.
The output of the above program will be:
species setosa 50 versicolor 50 virginica 50 dtype: int64
From the output, we see that there are three different classes. They are setosa, versicolor, and virginica. And each of these classes appears 50 times in the dataset.








































0 Comments