What is feature selection in machine learning and why do we need that?
A dataset may contain many columns, where each column corresponds to a feature or attribute. Now, if we run a machine learning model on all those features, we may not get optimal performance. Running a machine learning model on correctly selected features may also improve the training time of the algorithm. So, it is crucial to select the right features and run the machine learning model on the selected features.
There are various methods for feature selection. In this article, we will discuss feature selection based on variance.
What is feature selection based on variance in machine learning?
If a feature or attribute has low variance, that means the feature has a higher degree of similarity among the data points. In that case, it is always good to remove such features.
So, we can find out the variance of each numerical feature in a dataset and remove the features whose variances are below a certain threshold.
How to select features based on variance in machine learning?
We can read a dataset, filter the numerical columns and then, calculate the variance of each feature. After that, we can drop those features whose variance is less than a threshold.
For example, we can read the “penguins” dataset, filter the columns with numerical features, and then, print the variance of those features. We can use the following Python code for that purpose:
from sklearn.feature_selection import VarianceThreshold import seaborn df = seaborn.load_dataset("penguins") print(df.info()) features = df.drop(["species", "island", "sex"], axis=1) print(features.var())
The output of the above program will be: …






0 Comments