s

Finetuning

By Angela C

October 6, 2021 in software html

Reading time: 9 minutes.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Fine Tune the models

Finetune some shortlisted models.

  • Use grid search to find a good combination of hyperparameter values instead of manually fiddling with the values.

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc. It is possible and recommended to search the hyper-parameter space for the best cross validation score. Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use estimator.get_params().

For given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.

  • tell GridSearchCV what hyperparameters to experiment with and what values to try out.
  • GridSearchSV uses cross validation to evaluate all possible combinations of hyperparameter values.
  • from sklearn.model_selection import GridSearchCV

Search for best combination of hyperparameters for the RandomForestRegressor model:

If you have no idea of what value a hyperparameter should have then you could try out successive powers of 10, or a smaller number for a more finegrained approach. The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter.

Here the param_grid tells Scikit-learn to first evaluate all 12 (3 times 4) combinations of n_estimators and max_features hyperparameters specified in the first dict.

Then try all 6 (2x3) combination of hyperparameters specified in the second dict but with the bootstrap hyperparamter set to false. The grid search explores 18 combinations of RandomForestRegressors hyperparameter values, cv=5 trains each of the models 5 times.

This means that there will be 90 rounds of training in total so it could take a while. When finished use grid_search.best_params_ to get the best combination of parameters.

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

What are the best combination of parameters and best estimator?

  • grid_search.best_params_ for the best combination of hyperparameters found
  • grid_search.best_estimator_ for the best estimator
  • grid_search.cv_results_ for the evaluation scores of each hyperparameter search

Because the best hyperparameter combination found here are the maximum values specified (8 and 30) you should try searching again with higher values as the scores may continue to improve.

grid_search.best_params_
{'max_features': 8, 'n_estimators': 30}

Get the best estimator directly

# get the best estimator directly
grid_search.best_estimator_
RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)

Looking at the score of each hyperparameter combination tested during the grid search: The best solution here is got by setting the max_features to 8 and the n_estimators parameters to 30.

49682.273345071546 {‘max_features’: 8, ‘n_estimators’: 30}

The score here is slightly better than using the default hyperparameter values (50,182).

Note you can treat some of the data preparation steps as hyperparameters. For example to find out whether to add or not add a feature you are unsure about. It can also be used to find the best way to handle outliers, missing features, feature selection etc.

Ealuation scores of each hyperparameter combination tested during the grid search:

The best solution is got by setting max features to 8 and n_estimators to 30 which gives an RMSE score of 49,682. This score is slightly better than the score of 50,182 using the default hyperparameters.

49682.273345071546 {'max_features': 8, 'n_estimators': 30}

The best model has now been fine-tuned.

Tip from book!: Some of the data preparation steps can be treated as hyperparameters. Grid search will automatically find out whether or not to add a feature you are unsure about (for example using the ‘add_bedrooms_per_room’ hyperparameter of the ‘CombinedAttributeAdder’ transformer) and can be similarly used to automatically find the best way to handle outliers, missing features, feature selection etc.

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
63669.11631261028 {'max_features': 2, 'n_estimators': 3}
55627.099719926795 {'max_features': 2, 'n_estimators': 10}
53384.57275149205 {'max_features': 2, 'n_estimators': 30}
60965.950449450494 {'max_features': 4, 'n_estimators': 3}
52741.04704299915 {'max_features': 4, 'n_estimators': 10}
50377.40461678399 {'max_features': 4, 'n_estimators': 30}
58663.93866579625 {'max_features': 6, 'n_estimators': 3}
52006.19873526564 {'max_features': 6, 'n_estimators': 10}
50146.51167415009 {'max_features': 6, 'n_estimators': 30}
57869.25276169646 {'max_features': 8, 'n_estimators': 3}
51711.127883959234 {'max_features': 8, 'n_estimators': 10}
49682.273345071546 {'max_features': 8, 'n_estimators': 30}
62895.06951262424 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.176157539405 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.40652318466 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52724.9822587892 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.5691951261 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.495668875716 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
pd.DataFrame(grid_search.cv_results_).head()
mean_fit_time std_fit_time mean_score_time std_score_time param_max_features param_n_estimators param_bootstrap params split0_test_score split1_test_score ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
0 0.069634 0.005056 0.003996 0.000539 2 3 NaN {'max_features': 2, 'n_estimators': 3} -3.837622e+09 -4.147108e+09 ... -4.053756e+09 1.519591e+08 18 -1.064113e+09 -1.105142e+09 -1.116550e+09 -1.112342e+09 -1.129650e+09 -1.105559e+09 2.220402e+07
1 0.248510 0.059031 0.011722 0.003234 2 10 NaN {'max_features': 2, 'n_estimators': 10} -3.047771e+09 -3.254861e+09 ... -3.094374e+09 1.327062e+08 11 -5.927175e+08 -5.870952e+08 -5.776964e+08 -5.716332e+08 -5.802501e+08 -5.818785e+08 7.345821e+06
2 0.644345 0.008039 0.027368 0.000797 2 30 NaN {'max_features': 2, 'n_estimators': 30} -2.689185e+09 -3.021086e+09 ... -2.849913e+09 1.626875e+08 9 -4.381089e+08 -4.391272e+08 -4.371702e+08 -4.376955e+08 -4.452654e+08 -4.394734e+08 2.966320e+06
3 0.105477 0.001953 0.004034 0.000389 4 3 NaN {'max_features': 4, 'n_estimators': 3} -3.730181e+09 -3.786886e+09 ... -3.716847e+09 1.631510e+08 16 -9.865163e+08 -1.012565e+09 -9.169425e+08 -1.037400e+09 -9.707739e+08 -9.848396e+08 4.084607e+07
4 0.343328 0.003196 0.011180 0.001025 4 10 NaN {'max_features': 4, 'n_estimators': 10} -2.666283e+09 -2.784511e+09 ... -2.781618e+09 1.268607e+08 8 -5.097115e+08 -5.162820e+08 -4.962893e+08 -5.436192e+08 -5.160297e+08 -5.163863e+08 1.542862e+07

5 rows × 23 columns

Randomised Parameter Optimisation -

The grid search approach is fine when exploring relatively few combinations but when the hyperparameter space is large, use RandomizedSearchCV.

If you let randomised search run for 1000 iterations, this would explore 1000 different values for each hyperparameter instead of just a few values per hyperparameter with the grid search approach.

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

  • A budget can be chosen independent of the number of parameters and possible values.
  • Adding parameters that do not influence the performance does not decrease efficiency.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff8a2af23d0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff8a2afcc70>},
                   random_state=42, scoring='neg_mean_squared_error')
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
49150.70756927707 {'max_features': 7, 'n_estimators': 180}
51389.889203389284 {'max_features': 5, 'n_estimators': 15}
50796.155224308866 {'max_features': 3, 'n_estimators': 72}
50835.13360315349 {'max_features': 5, 'n_estimators': 21}
49280.9449827171 {'max_features': 7, 'n_estimators': 122}
50774.90662363929 {'max_features': 3, 'n_estimators': 75}
50682.78888164288 {'max_features': 3, 'n_estimators': 88}
49608.99608105296 {'max_features': 5, 'n_estimators': 100}
50473.61930350219 {'max_features': 3, 'n_estimators': 150}
64429.84143294435 {'max_features': 5, 'n_estimators': 2}

Ensemble methods

Another way to fine-tune your system is to try to combine the models that perform best. Group or ensemble often perform better than the best individual model similarly to how random forrests perform better than the underlying decision trees.

Analyse the best models and their errors:

Inspect the best models to get insights on the problem.

The RandomForestRegressor model can indicate the relative importance of each attribute for making accurate predictions:

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

Display the importance scores next to their corresponding attribute names

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]

cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
[(0.36615898061813423, 'median_income'),
 (0.16478099356159054, 'INLAND'),
 (0.10879295677551575, 'pop_per_hhold'),
 (0.07334423551601243, 'longitude'),
 (0.06290907048262032, 'latitude'),
 (0.056419179181954014, 'rooms_per_hhold'),
 (0.053351077347675815, 'bedrooms_per_room'),
 (0.04114379847872964, 'housing_median_age'),
 (0.014874280890402769, 'population'),
 (0.014672685420543239, 'total_rooms'),
 (0.014257599323407808, 'households'),
 (0.014106483453584104, 'total_bedrooms'),
 (0.010311488326303788, '<1H OCEAN'),
 (0.0028564746373201584, 'NEAR OCEAN'),
 (0.0019604155994780706, 'NEAR BAY'),
 (6.0280386727366e-05, 'ISLAND')]

With this information you can see some less useful features that could be dropped. For example only one of the ocean_proximity categories is really useful so the others could possibly be dropped.

Also look at specific errors the system makes, try and understand why and how the problem could be fixed, maybe by adding extra features, getting rid of uninformative ones, getting rid of outliers etc.

Evaluate the system on the test set:

Once you have a system that performs well enough, evaluate the final model on the test set.

  • Get the predictors and labels from the test set
  • run the full_pipeline to transform the data (call transform and not fit_transform as you do not want to fit the test set)
  • evaluate the final model on the test set.
final_model = grid_search.best_estimator_

# get the predictors and labels from the test set
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

# run the full pipeline to transform the data (use 'transform' and not 'fit_transform' here)
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
# evaluate the final model on the 
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
47730.22690385927

Sometimes a point estimate of the generalisation error such as this might not be enough to convince you to use this model in production. You would like to get an idea of how precise this model is. You could compute a confidence interval for the generalisation error using scipy.stats.t.interval().

Compute a 95% confidence interval for the test RMSE:

from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))
array([45685.10470776, 49691.25001878])

The interval can be manually computed:

m = len(squared_errors)
mean = squared_errors.mean()
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)
(45685.10470776, 49691.25001877858)

Alternatively use a z-scores rather than t-scores:

zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)
(45685.717918136455, 49690.68623889413)

Note that if a lot of hyperparameter tuning was performed then the performance will be slightly worse than when measured using cross validation as your system will end up fine-tuned to perform well on the validation data and is likely to not perform as well on unknown data. However you should not tweak the hyperparameters to make the numbers look good on the test set as the improvements are unlikely to generalise to new data.


s