Feature Scaling – Machine Learning Notes

Feature Scaling, also known as Data Normalisation, is a data preprocessing technique used in Machine Learning to normalise the range of predictor variables (i.e. independent variables, or features).

This is done to ensure that all the input variables have values on a normalised range. Since ranges of values can be widely different, and many Machine Learning algorithms use some notion of distance between data points, features with broader ranges will have a stronger impact on the computation of such distance.

By scaling the features into a normalised range, their contribution to the final result will be about the same.

There are several methods to perform feature scaling, common examples include Data Standardisation and Min-Max Normalisation.

Data Standardisation

Each predictor variable is transformed by subtracting its mean and dividing by the standard deviation. The resulting distribution is centred in zero and has unit variance.

x^{\prime} = \frac{x - x_{mean}}{\sigma_{x}}

Min-Max Normalisation

Also called rescaling, the transformed values are in the [0, 1] range. Each predictor variable is transformed by subtracting its minimum value and dividing by the difference between maximum and minimum value.

x^{\prime} = \frac{x - x_{min}}{x_{max} - x_{min}}

The min-max normalisation approach can be generalised to produce transformed variables with values in any [a, b] range, using the following formula:

x^{\prime} = a + \frac{(x - x_{min})(b - a)}{x_{max} - x_{min}}

Do all Machine Learning algorithms need feature scaling?

Algorithms based on distance/similarity and curve fitting require scaling (kNN, SVM, Neural Networks, Linear/Logistic Regression).

Tree-based algorithms (Random Forest, XGBoost) and Naive Bayes don’t require scaling.

Scaling training/test data sets correctly

When scaling on a dataset that is going to be used for supervised learning using a train/test split, we need to re-use the training parameters to transform the test data set. By “training parameters” in this context we mean the relevant statistics like mean and standard deviation for data normalisation.

Why do we need to compute these statistics on the training set only? When using a trained model to make predictions, the test data should be “new and unseen”, i.e. not available at the time the model is built.

In Python/scikit-learn, this translates roughly to the following:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_train_data = scaler.fit_transform(train_data)
scaled_test_data = scaler.transform(test_data)

The first function fit_transform() computes the mean and standard deviation on the training data, while the second function transform() re-uses those statistics and applies them to transform the test data.

Video Course: Practical Python Data Science Techniques

I’m happy to announce the recent release of my second video course,
Practical Python Data Science Techniques published with Packt Publishing.

VideoCourse-Cover

Links:

This video course follows my first introductory course (Data Analysis with Python) and provides the audience with recipe-like solutions to common Data Science problems.

In particular, with about 2.5 hours of material, the video course covers the following topics:

  1. Exploring Your Data
    This section covers some of the most common techniques related to loading data, performing exploratory analysis and cleaning your data to get them in the right shape.
  2. Dealing with Text
    describes the common pre-processing techniques that you need to deal with text, from tokenisation to normalisation, to calculating word frequencies.
  3. Machine Learning Problems
    describes the most common Machine Learning problems and how to tackle them using scikit-learn.
  4. Time Series and Recommender Systems
    The last section groups some miscellanous topics, in particulr Time Series Analysis and the basics to implement a recommender system.

More details about the content of the course are available on the PacktPub’s page, and of course you can check out the code examples on my GitHub (links on top of this page).

If you are a beginner you may also be interested in my other video course, Data Analysis with Python (see video course on PacktPub.com, course material on GitHub and course overview on this blog).

@MarcoBonzanini