target. The features DataFrame contains the total bill amount and size as columns. And the target DataFrame contains the tip amount as a column.
df_features = df.filter(items=["total_bill", "size"]) df_target = df.filter(items=["tip"])
After that, we are splitting the dataset into training and test set. The size of the test set is 20% of the dataset. The shuffle=True parameter indicates that we are shuffling the dataset before the split. And the random_state=1 parameter is used to control the random number generator that is used to control the shuffling.
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target["tip"], test_size=0.2, shuffle=True, random_state=1)
Now, we are initializing the regressor. The random_state=1 parameter in the DecisionTreeRegressor() constructor controls the randomness of the estimator.
The fit() method is used to learn from the dataset. And the predict method is used to predict the target variable for the test set.
regressor = DecisionTreeRegressor(random_state=1) regressor.fit(X_train, y_train) y_test_pred = regressor.predict(X_test)
Now, we can compare the y_test_predict and y_test to measure the performance of the model. We are here calculating the R-squared score and the Root Mean Square Error (RMSE).
r2 = r2_score(y_test, y_test_pred) rmse = mean_squared_error(y_test, y_test_pred, squared=False)
The output of the above program will be like the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None
total_bill size
0 16.99 2
1 10.34 3
2 21.01 3
3 23.68 2
4 24.59 4
tip
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
R2 Score: 0.13122381189707366
RMSE: 1.5120819543710895
As we can see the performance of the model is not so good. So, in our next article, we will try to improve this performance using ensemble learning.








































0 Comments