What are outliers?
Let’s say our target audience is people within the age group of 20 to 30 years, both inclusive. We took input from various people along with their ages and saved the responses. Later, when we started analyzing the data, we saw that some of the respondents are below 20 years old and some of them are above 30 years old. These respondents are not our target audience. So, we need to remove their responses from our dataset. If we do not do so, then the data may lead to inefficient inferences, which we do not want. The respondents who are not our target audience, and yet whose data are present in the dataset are called outliers of the dataset.
How to detect outliers in a dataset?
We can use a box and whisker plot to see if outliers are present in the dataset. For example, let’s read the titanic dataset and plot a box plot of the age column of the dataset.
import seaborn from matplotlib import pyplot df = seaborn.load_dataset("titanic") seaborn.boxplot(data=df, x="age") pyplot.savefig("titanic-age-outliers.png") pyplot.close()
The output plot will look like the following:
Here, we can see some dots on the right-hand side of the plot. They are possible outliers.
We can also detect outliers in a dataset using Inter Quartile Range or IQR. We know that quartiles divide data into four equal …






0 Comments