In our previous article, we discussed feature selection based on recursive elimination using sklearn. We can also select features based on model performance. For example, in the case of linear regression, we can select features and based on that calculate regression coefficients and model performance. And then, based on the model performance we can select the best features.
In this article, we will take the example of a Random Forest Regression. And we will run the model on the California housing dataset. And we will select features from the dataset based on model performance.
Let’s read the California housing dataset and create two DataFrames. One DataFrame contains the features and the other contains only the labels.
from sklearn.feature_selection import SelectFromModel from sklearn.datasets import fetch_california_housing from sklearn.ensemble import RandomForestRegressor data = fetch_california_housing(as_frame=True) df = data.frame print(df.info()) print(df.head()) df_labels = df[["MedHouseVal"]] df_features = df.drop(["MedHouseVal"], axis=1) print(df_features.head()) print(df_labels.head())
The fetch_california_housing(as_frame=True) function fetches the California housing dataset using the sklearn Python library. The as_frame=True parameter indicates that the returned data contains the DataFrame as an attribute.
The following Python statement creates a DataFrame from the returned data…






0 Comments