median_age columns.
import pandas from matplotlib import pyplot df1 = pandas.read_csv("titanic.csv") df2 = df1[["age", "survived"]] print("Mean Age: ", df2["age"].mean()) print("Median Age: ", df2["age"].median()) df2.insert(1, "mean_age", df2.age.fillna(df2["age"].mean())) df2.insert(1, "median_age", df2.age.fillna(df2["age"].median())) fig = pyplot.figure() ax = fig.add_subplot(111) df2["age"].plot.kde(color="blue") df2["mean_age"].plot.kde(color="green") df2["median_age"].plot.kde(color="black") lines, labels = ax.get_legend_handles_labels() ax.legend(lines, labels, loc="best") pyplot.savefig("titanic-all-kde.png") pyplot.close()
The resulting KDE plot looks like the following:
So, as we can see we can fill the missing numerical values with the mean value or the median value. If the KDE plot looks distorted after mean imputation and the distribution of data is skewed, we should try median imputation instead.






0 Comments