# Data Scaling for Machine Learning — The Essential Guide

What is Standardization and why is it so darn important?

It’s possible that you will come across datasets with lots of numerical noise built-in, such as variance or differently-scaled data, so a good preprocessing is a must before even thinking about machine learning. A good preprocessing solution for this type of problem is often referred to as standardization.

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In `scikit-learn` this is often a necessary step because many models assume that the data you are training on is normally distributed, and if it isn’t, your risk biasing your model.

You can standardize your data in different ways, and in this article, we’re going to talk about the popular data scaling method —data scaling. Or standard scaling to be more precise.

It’s also important to note that standardization is a preprocessing method applied to continuous, numerical data, and there are a few different scenarios in which you want to use it:

1. When working with any kind of model that uses a linear distance metric or operates on a linear space — KNN, linear regression, K-means
2. When a feature or features in your dataset have high variance — this could bias a model that assumes the data is normally distributed, if a feature in has a variance that’s an order of magnitude or greater than other features

Let’s now proceed with the data scaling.

## Data scaling

Scaling is a method of standardization that’s most useful when working with a dataset that contains continuous features that are on different scales, and you’re using a model that operates in some sort of linear space (like linear regression or K-nearest neighbors)

Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one This will make it easier to linearly compare features. Also, this a requirement for many models in `scikit-learn`.

Let’s take a look at a dataset called wine:

``````import pandas as pd
import numpy as np
from sklearn import datasets
wine = pd.DataFrame(
data=np.c_[wine['data'], wine['target']],
columns=wine['feature_names'] + ['target']
)
``````

We want to use the `ash`, `alcalinity_of_ash`, and `magnesium` columns in the wine dataset to train a linear model, but it’s possible that these columns are all measured in different ways, which would bias a linear model. Using the `describe()` function returns descriptive statistics about the dataset:

``````wine[['magnesium', 'ash', 'alcalinity_of_ash']].describe()
``````

We can see that the max of `ash` is 3.23, max of `alcalinity_of_ash` is 30, and a max of `magnesium` is 162. There are huge differences between the values, and a machine learning model could here easily interpret `magnesium` as the most important attribute, due to larger scale.

Let’s standardize them in a way that allows for the use in a linear model. Here are the steps:

1. Import `StandardScaler` and create an instance of it
2. Create a subset on which scaling is performed
3. Apply the scaler fo the subset

Here’s the code:

``````from sklearn.preprocessing import StandardScaler

# create the scaler
ss = StandardScaler()

# take a subset of the dataframe you want to scale
wine_subset = wine[['magnesium', 'ash', 'alcalinity_of_ash']]

# apply the scaler to the dataframe subset
wine_subset_scaled = ss.fit_transform(wine_subset)
``````

Awesome! Let’s see how the first couple of rows of scaled data look like:

The values are now much closer together. To see how scaling actually impacts the model’s predictive power, let’s make a quick KNN model.

First, with the non-scaled data:

``````from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X = wine.drop('target', axis=1)
y = wine['target']

X_train, X_test, y_train, y_test = train_test_split(X, y)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))
>>> 0.666666666666
``````

Not so good of an accuracy. Let’s scale the entire dataset and repeat the process:

``````ss = StandardScaler()

X_scaled = ss.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))
>>> 0.97777777777777777
``````

As you can see, the accuracy of our model increased significantly. I’ll leave further tweaking of this KNN classifier up to you, and who knows, maybe you can get all the classifications correctly.

Let’s wrap things up in the next section.

## Before you go

That’s pretty much it for data standardization and why it is important. We’ll compare `StandardScaler` with other scalers some other time. The take-home point of this article is that you should use `StandardScaler`whenever you need normally distributed (relatively) features.

To be more precise, use `StandardScaler` whenever you’re using a model that assumes that the data is normally distributed - such as KNN or linear regression.