DataFrame contains the features total bill amount and size. And the df_target DataFrame contains the tip amount in a column.
df_features = df.filter(items=["total_bill", "size"]) df_target = df.filter(items=["tip"])
Now, we are splitting the dataset into train and test set. The size of the test set is 20% of the dataset. And we are shuffling the dataset before splitting the dataset. The random_state=1 parameter is used to control the random number generator that is used for shuffling.
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target["tip"], test_size=0.2, shuffle=True, random_state=1)
Now, we are initializing the random forest regressor. The randomness of the bootstrapping of the samples or the sampling of the features is controlled by the random_state=1 parameter.
The random forest regressor uses the fit() method to learn from the dataset. And the predict() method is used to predict the target variable for the test set.
regressor = RandomForestRegressor(random_state=1) regressor.fit(X_train, y_train) y_test_pred = regressor.predict(X_test)
Now, we can compare the y_test_pred and y_test to measure the performance of the model. Here, we are calculating the R-squared score and the Root Mean Square Error (RMSE).
r2 = r2_score(y_test, y_test_pred) rmse = mean_squared_error(y_test, y_test_pred, squared=False)
The output of the above program will be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None
total_bill size
0 16.99 2
1 10.34 3
2 21.01 3
3 23.68 2
4 24.59 4
tip
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
R2 Score: 0.4971554940373689
RMSE: 1.150371809730061








































0 Comments