Let’s say we are reading the titanic dataset. The age column has some missing values. And we want to fill in the missing values. But before we do so, we want to know the distribution of data in the age column. Is it a normal distribution or a skewed one? If it is normal distribution, we can fill in the missing values with the mean age. And if it is skewed distribution, then we can fill in the missing values with the median age. Using a histogram, we can see the distribution of data in a column.
So, let’s try to plot a histogram of the data in the age column of the dataset.
import pandas from matplotlib import pyplot df = pandas.read_csv("titanic.csv") print(df.info()) df["age"].hist() pyplot.savefig("titanic-age.png")
The output of df.info() shows the following:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null object 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null object 8 class 891 non-null object 9 who 891 non-null object 10 adult_male 891 non-null bool 11 deck 203 non-null object 12 embark_town 889 non-null object 13 alive 891 non-null object 14 alone 891 non-null bool dtypes: bool(2), float64(2), int64(4), object(7) memory usage: 92.4+ KB None
As we can see the age column has 714 non-null values out of 891 columns.
The histogram of the age column looks like the following:






0 Comments