s

Selecting

By Angela C

October 6, 2021 in software html

Reading time: 4 minutes.

Select and train a model

After framing the problem, getting and exploring the data, sampling a training set and a test set and written transformation pipelines to clean up and prepare the data for machine learning algorithms automatically, the next step is to select and train a machine learning model. Because of all the previous steps taken above, this will be relatively easy.

Train a linear regression model

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
LinearRegression()

Try it on a few instances from the training set:

Trying the full preprocessing pipeline on some training instances

# try the full preprocessing pipeline on a few training instances
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [210816. 317904. 211040.  59112. 189832.]

Compare against the actual values:

print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
some_data_prepared
array([[-1.15604281,  0.77194962,  0.74333089, -0.49323393, -0.44543821,
        -0.63621141, -0.42069842, -0.61493744, -0.31205452, -0.08649871,
         0.15531753,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , -0.90896655, -1.0369278 ,
        -0.99833135, -1.02222705,  1.33645936,  0.21768338, -0.03353391,
        -0.83628902,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, -0.31365989, -0.15334458,
        -0.43363936, -0.0933178 , -0.5320456 , -0.46531516, -0.09240499,
         0.4222004 ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.01706767,  0.31357576, -0.29052016, -0.36276217, -0.39675594,
         0.03604096, -0.38343559, -1.04556555, -0.07966124,  0.08973561,
        -0.19645314,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.49247384, -0.65929936, -0.92673619,  1.85619316,  2.41221109,
         2.72415407,  2.57097492, -0.44143679, -0.35783383, -0.00419445,
         0.2699277 ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])

Measure the regression models RMSE

  • See sklearn-metrics, regression metrics
  • Scikit-learn mean_squared_error function computes mean square error, a risk metric corresponding to the expected value of the squared (quadratic) error or loss.
  • measure the RMSE on the whole training set
  • can set squared=False on newer versions to avoid having the get the square root.

Measure the regression model’s RMSE on the whole training set:

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
68628.32454669532

You can get the RMSE directly by calling the mean_squared_error() function with squared=False.

# set Squared=False to avoid having to get square root
lin_mse = mean_squared_error(housing_labels, housing_predictions, squared=False)

lin_rmse
68628.32454669532

A prediction error of over $68 k when the median housing values range from 120 k to 265k. This is an example of the model underfitting the training data. It means the model is now powerful enough or the features don’t provide enough to make good predictions. To fix underfitting select a more powerful model or feed the algorithm with better features. If the model is regularised you can reduce the constraints on the model.

Mean absolute error

The median_absolute_error is particularly interesting because it is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction.

The median_absolute_error is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction.

from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae
49444.22728924418

A Decision Tree Regressor

A DecisionTreeRegressor is a powerful model that can find complex non-linear relationships in the data.

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

  • import from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
DecisionTreeRegressor(random_state=42)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
0.0

This zero error above does not mean that the model is absolutely perfect. Instead it implies that the model has badly overfit the data.

Cross Validation

Note you should not yet touch the test set until you are ready to launch a model you are confident in. Therefore you need to do model validation on part of the training set.

This reduces the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.

  • Could use the train_test_split function to split the training set into a smaller training and validation set, then train the model on the smaller training set and evaluate the models on the validation set.

  • Alternatively use K-fold cross validation to split the training set into k number of folds, then trains and evaluates the model k times, picking a different fold for evaluation every time and training on the other k-1 folds.

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.

Note that Scikit-learns cross-validation feature expects a utility function rather than a cost function so the scoring function is the opposite of the MSE. (With a cost function , lower is better, with a utility function greate is better). This is why the code below computes -scores before calculating the square root.