If two or more features in a dataset are strongly correlated, then we should select one of them. Selecting the right features for a machine learning model not only improves the performance of the model but also can reduce the training time of the algorithm.
Whether two features are strongly correlated is determined by the correlation coefficient of the two features. A correlation coefficient between two random variables X and Y is a number between -1 to +1. The sign of the correlation coefficient determines whether the variables are positively or negatively correlated. If the correlation coefficient has a positive sign, that means the two variables X and Y are positively correlated. In other words, if we increase X, Y will also increase. On the other hand, if the correlation coefficient has a negative sign, that means the variables are negatively correlated. In other words, if we increase X, Y will decrease.
And if the magnitude of the correlation coefficient is closer to 1, that means the variables are strongly correlated.
In a dataset, there are often more than two features. We can determine the correlation matrix of the features. Each value in the correlation matrix indicates the correlation coefficient of the variables that correspond to the row and the column of the value.
In Python, we can use the corr() function to determine the correlation matrix between the features.
import seaborn from matplotlib import pyplot df = seaborn.load_dataset("penguins") print(df.info()) features = df.drop(["species", "island", "sex"], axis=1) corr_matrix = features.corr() seaborn.heatmap(data=corr_matrix, annot=True) pyplot.show()
We can also use the correlation matrix to plot a heatmap using the seaborn Python library. The resulting heatmap will be like the following: …
0 Comments