We often see missing values in a dataset. Missing values are those values in a dataset that does not contain any data. These missing values, if not handled properly, can change data patterns. So, it is extremely important to handle missing values in a dataset.
A dataset column can contain numerical data or categorical data. When a dataset column contains numerical data and the column has missing values, we can use statistical techniques to handle those missing data. Using statistical techniques to handle missing numerical values is also called imputation.
Let’s look into a dataset first. Let’s read the titanic data set and see what all columns contain missing values.
import pandas df = pandas.read_csv("titanic.csv") print(df.info()) print("Percentage of missing values: \n", df.isnull().mean()*100)
The output is like the following:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null object 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null object 8 class 891 non-null object 9 who 891 non-null object 10 adult_male 891 non-null bool 11 deck 203 non-null object 12 embark_town 889 non-null object 13 alive 891 non-null object 14 alone 891 non-null bool dtypes: bool(2), float64(2), int64(4), object(7) memory usage: 92.4+ KB None Percentage of missing values: survived 0.000000 pclass 0.000000 sex 0.000000 age 19.865320 sibsp 0.000000 parch 0.000000 fare 0.000000 embarked 0.224467 class 0.000000 who 0.000000 adult_male 0.000000 deck 77.216611 embark_town 0.224467 alive 0.000000 alone 0.000000 dtype: float64
So, the dataset contains numerous columns. Out of all the columns, the age column contains floating point values and it has …






0 Comments