Feature Scaling – Machine Learning Notes

Feature Scaling, also known as Data Normalisation, is a data preprocessing technique used in Machine Learning to normalise the range of predictor variables (i.e. independent variables, or features).

This is done to ensure that all the input variables have values on a normalised range. Since ranges of values can be widely different, and many Machine Learning algorithms use some notion of distance between data points, features with broader ranges will have a stronger impact on the computation of such distance.

By scaling the features into a normalised range, their contribution to the final result will be about the same.

There are several methods to perform feature scaling, common examples include Data Standardisation and Min-Max Normalisation.

Data Standardisation

Each predictor variable is transformed by subtracting its mean and dividing by the standard deviation. The resulting distribution is centred in zero and has unit variance.

x^{\prime} = \frac{x - x_{mean}}{\sigma_{x}}

Min-Max Normalisation

Also called rescaling, the transformed values are in the [0, 1] range. Each predictor variable is transformed by subtracting its minimum value and dividing by the difference between maximum and minimum value.

x^{\prime} = \frac{x - x_{min}}{x_{max} - x_{min}}

The min-max normalisation approach can be generalised to produce transformed variables with values in any [a, b] range, using the following formula:

x^{\prime} = a + \frac{(x - x_{min})(b - a)}{x_{max} - x_{min}}

Do all Machine Learning algorithms need feature scaling?

Algorithms based on distance/similarity and curve fitting require scaling (kNN, SVM, Neural Networks, Linear/Logistic Regression).

Tree-based algorithms (Random Forest, XGBoost) and Naive Bayes don’t require scaling.

Scaling training/test data sets correctly

When scaling on a dataset that is going to be used for supervised learning using a train/test split, we need to re-use the training parameters to transform the test data set. By “training parameters” in this context we mean the relevant statistics like mean and standard deviation for data normalisation.

Why do we need to compute these statistics on the training set only? When using a trained model to make predictions, the test data should be “new and unseen”, i.e. not available at the time the model is built.

In Python/scikit-learn, this translates roughly to the following:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_train_data = scaler.fit_transform(train_data)
scaled_test_data = scaler.transform(test_data)

The first function fit_transform() computes the mean and standard deviation on the training data, while the second function transform() re-uses those statistics and applies them to transform the test data.

Video Course: Data Analysis with Python

VideoCourse-Cover

I’m happy to announce the release of my first video course Data Analysis with Python, published with Packt Publishing.

Links:

With 2 hours 26 minutes of content segmented into short video sessions, this course aims at introducing the audience to the field of Data Science using Python, discussing some of the fundamental tools of the trade.

Bird’s eye view on the course:

  1. Python Core
    • Course overview
    • Python Core Concepts and Data Types
    • Understanding Iterables
    • List Comprehensions
    • Dates and Times
    • Accessing Raw Data
  2. NumPy for Array Computation
    • Creating NumPy Arrays
    • Basic Stats and Linear Algebra
    • Reshaping, Indexing, and Slicing
  3. Pandas for Data Frames
    • Getting Started with Pandas
    • Essential Operations with Data Frames
    • Summary Statistics from a Data Frame
    • Data Aggregation over a Data Frame
  4. Exercise: Titanic Survivor Analysis
    • Exploratory Analysis of the Titanic Disaster Data Set
    • Predicting Titanic Survivor as a Supervised Learning Problem
    • Performing Supervised Learning with scikit-learn

More details are discussed on the PacktPub’s page.

Please have a look at the companion code for the course on my GitHub page, so you can have an idea of the topics discussed in the course.