Random forests use an ensemble learning method for classification or regression. A random forest classifier is used to solve classification problems. When we train a random forest with training data, it generates several decision trees. And then, when input features are provided, the random forest selects the class that is selected by most of the trees in the random forest.
In our previous articles, we discussed classification trees and regression trees. They are good algorithms, but decision trees have the problem of overfitting data. Random forests use several decision trees to correct this problem of overfitting. Interested readers, who want to know more about how random forests work, please refer to this youtube video: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ
How to solve classification problems using a random forest classifier in sklearn?
Letโs read the pima diabetes dataset. The dataset has total 9 columns. Out of these 9 columns, 8 columns represent features, such as glucose, blood pressure, insulin, BMI, etc. And the Outcome column represents whether the patient has diabetes.
We will use the following Python code to read the dataset and use a random forest classifier to solve the classification problem.
import pandas from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score df = pandas.read_csv("diabetes.csv") print(df.info()) print(df.head()) df_features = df.drop(labels=["Outcome"], axis=1) df_target = df.filter(items=["Outcome"]) print(df_features.head()) print(df_target.head()) X_train, X_test, y_train, y_test = train_test_split(df_features, df_target["Outcome"], test_size=0.2, shuffle=True, random_state=1) classifier = RandomForestClassifier(random_state=1) classifier.fit(X_train, y_train) y_test_pred = classifier.predict(X_test) accuracy = accuracy_score(y_test, y_test_pred) print("Accuracy Score: ", accuracy)
Here, we are first reading the diabetes dataset and splitting the dataset into features and target. The df_features DataFrame contains the features in its columns. And the df_target DataFrame contains the target variable Outcome in its column…






0 Comments