the machine learning model’s accuracy. One way to handle these missing values is to use the dropna() method to drop rows that contain missing values.
This way of handling missing values may work well if the dataset is large enough. But a better approach is to fill the missing values with appropriate statistical measures. The method of filling missing values of a dataset with appropriate statistical measures is known as data imputation.
In sklearn, we can use the SimpleImputer class from the sklearn.impute module to perform data imputation. For example, let’s say we want to fill the missing values in the age column of the dataset with the mean age. We can use the following Python code for that purpose:
import numpy import seaborn from sklearn.impute import SimpleImputer df = seaborn.load_dataset("titanic") imputer = SimpleImputer(missing_values=numpy.nan, strategy="mean") df[["age"]] = imputer.fit_transform(df[["age"]]) print(df.isnull().mean()*100)
Here, we are first reading the dataset using the seaborn Python library. After that, we are using the SimpleImputer class for data imputation. The missing_values parameter indicates that we are replacing all numpy.nan values from the mentioned column of the dataset. The strategy parameter indicates the imputation strategy. Here, we are filling all the missing values with the mean value of the column. So, the strategy parameter is “mean.” The other strategies can be “median”, “most_frequent” or “constant” that will fill the missing values with the median value, most frequent value of the column or a constant value, respectively.
After that, we are calling the fit_transform() method. Please note that we are filling in the missing values of the age column of the dataset.
After the data imputation, we are printing the percentage of missing values in each column of the dataset. The output now shows:
survived 0.000000 pclass 0.000000 sex 0.000000 age 0.000000 sibsp 0.000000 parch 0.000000 fare 0.000000 embarked 0.224467 class 0.000000 who 0.000000 adult_male 0.000000 deck 77.216611 embark_town 0.224467 alive 0.000000 alone 0.000000 dtype: float64
As we can see all the missing values of the age column of the dataset have been filled and now, the percentage of missing values in the age column is zero.






0 Comments