19.865320% missing values (How to calculate the percentage of missing values in a column in a dataset?). So, let’s create a DataFrame that contains only the age and survived columns.
import pandas df1 = pandas.read_csv("titanic.csv") df2 = df1[["age", "survived"]] print(df2.head())
The output will be:
age survived 0 22.0 0 1 38.0 1 2 26.0 1 3 35.0 1 4 35.0 0
Now, as we saw earlier, the age column has 19.865320% missing values. Now, when a column contains missing numerical values, the missing values can be handled using mean imputation or median imputation.
In the case of mean imputation, we fill the missing values with the mean value of the column. And, in the case of median imputation, we fill the missing values with the median value of the column.
In machine learning, whether we should use mean imputation or median imputation depends on the distribution of data contained in the column. Usually, if the distribution of the data is normal, we use mean imputation. And if the data distribution is skewed, we use median imputation.
So, what should we do in our example? Let’s try to find out. Let’s first print the mean and median values of the age column.
import pandas df1 = pandas.read_csv("titanic.csv") df2 = df1[["age", "survived"]] print("Mean Age: ", df2["age"].mean()) print("Median Age: ", df2["age"].median())
The output is:
Mean Age: 29.69911764705882 Median Age: 28.0
Now, we should look into the distribution of the data. To do so, we will plot the values of the age column using a kernel …






0 Comments