Some notes on Scikit-Learn

By Angela C

October 5, 2021 in software html

Reading time: 3 minutes.

Some notes on Scikit-Learn from Chapter 2

The California Houses dataset is also available from the sklearn-datasets module.

The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets. The dataset is also available from sklearn.datasets module. However it does not have a total bedrooms and total rooms feature as the one here does.

X, y = fetch_california_housing(return_X_y=True, as_frame=True) will fetch the target as a separate dataframe.

#from sklearn.datasets import fetch_california_housing

# this returns a dataset Bunch, a dictionary like object.
#californian_housing = fetch_california_housing(as_frame=True)

#californian_housing.keys()

# to pull out the dataframe with features and targets
#housing = californian_housing['frame']
#housing.head()

#X, y = fetch_california_housing(return_X_y=True, as_frame=True)

A pipeline is a sequence of data processing components in a machine learning system. There is usually a lot of data to manipulate and transform. Each component usually takes large amounts of data and processes it, passing the results to the next component in the pipeline with each components being pretty much self-contained. The interface between components is the data store.

Selecting Performance Measures

The Root Mean Square Error (RMSE) is suitable for regression and indicates how much error the system makes in its predictions. The RMSE gives a higher weight to large errors.

$RMSE(X,h)$ is the cost function measured on the set of training examples using the hypothesis $h$ where the hypothesis is the system’s prediction function.

The Mean Asbolute Error (MAE) is another performance measurement (also known as average absolute deviation).

The $RMSE$ and $MAE$ are both ways to ensure the distance between the vector of predictions and the vector of target values.

The $RMSE$ corresponds to the Euclidian norm also called the $l_2$ norm. $||.||_2$ or $||.||$. The $MAE$ corresponds to the Manhattan Norm or $l_1$ norm. (It is known as Manhattan norm as it measures the distance between 2 points in a city if you can only travel along orthogonal city blocks). See Chapter 2 which provides the equations etc behind this.

The RMSE is more sensitive to outliers than the MAE but when they are exponentially rare (a bell-shaped curve) then it performs well.

About Scikit-learn’s API

(based on notes in the book.)

All objects share a consistent and simple interface

Estimators: Any object that can estimate some parameters based on a dataset. (such as an Imputer)

the estimation is performed by the fit() method which takes only a dataset as a parameter
any other parameters are considered hyperparameters (such as an imputer’s strategy) and must be set as an instance variable.

Transformers: Some estimators can also transform a dataset.

the transformation is performed by the transform() method with the dataset as a parameter
returns a transformed dataset based on the learned parameters
convenience method fit_transform() is equivalent to calling fit and then transform.

Predictors: Some estimators can make predictions on a dataset.

predict() method takes a dataset of new instances and returns a dataset of corresponding predictions.
score() method measures the quality of the predictions given a test set.

Inspection: All the estimator’s hyperparameters are accessible directly via public instance variables with an underscore prefix. (such as Imputer_statistics_).

Non-proliferation of classes:

Datasets are represented as NumPy arrays or SciPy sparse matrices instead of homemade classes.
Hyperparameters are just regular python strings or numbers.

Composition: Existing building blocks are reused as much as possible.

Sensible defaults: Scikit-learn provides reasonable default values for most parameters.

Selecting Performance Measures

About Scikit-learn’s API

Related Pages