Feature Scaling Normalization Standardization

 

Feature Scaling Normalization Standardization in Data Mining

In this article, I will discuss,

  • Normalization – Standardization (Z-score scaling)
  • Min-Max scaling
  • Robust scaling

Video Tutorial – Feature Scaling Normalization Standardization

Click here to download the dataset titanic.csv file, which is used in this article for demonstration.

First, we will import the required libraries like pandas, NumPy, os, and train_test_split from sklearn.model_selection.

import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

print(data.shape)

data.head(3)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

SurvivedPclassSexAgeSibSpFare
OO3male22.017.2500
111female38.0171.2833
213female26.0O7.9250
Data set

Note that we include the target variable in the X_train because we need it to supervise our discretization this is not the standard way of using train-test-split.

X_train, X_test, y_train, y_test = train_test_split(data, data, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

Output:

((623, 6), (268, 6))

Normalization – Standardization (Z-score scaling)

To check whether the data is already normalized. If the mean = 0 and standard deviation = 1, then the data is already normalized. Here there is no need to do feature scaling.

print(X_train['Fare'].mean())
print(X_train['Fare'].std())

Output:
32.458272552166925
48.257658284816124

The Z-score scaling is performed using the below formula.

z = (X – X.mean) / std

# add the new created feature
from sklearn.preprocessing import StandardScaler
ss = StandardScaler().fit(X_train[['Fare']])
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_zscore'] = ss.transform(X_train_copy[['Fare']])
print(X_train_copy.head(6))

Output:
     Survived  Pclass     Sex   Age  SibSp     Fare  Fare_zscore
857         1       1    male  51.0      0  26.5500    -0.122530
52          1       1  female  49.0      1  76.7292     0.918124
386         0       3    male   1.0      5  46.9000     0.299503
124         0       1    male  54.0      0  77.2875     0.929702
578         0       3  female   NaN      1  14.4583    -0.373297
549         1       2    male   8.0      1  36.7500     0.089005

Now we find the mean and standard deviation.

print(X_train_copy['Fare_zscore'].mean())
print(X_train_copy['Fare_zscore'].std())

Output:
5.916437306188636e-17
1.0008035356861

Min-Max scaling

Scaled values are calculated in Min-Max scaling is performed using the below formula.

X_scaled = (X – X.min / (X.max – X.min)

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler().fit(X_train[['Fare']])
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_minmax'] = mms.transform(X_train_copy[['Fare']])
print(X_train_copy.head(6))

Output:
     Survived  Pclass     Sex   Age  SibSp     Fare  Fare_minmax
857         1       1    male  51.0      0  26.5500     0.051822
52          1       1  female  49.0      1  76.7292     0.149765
386         0       3    male   1.0      5  46.9000     0.091543
124         0       1    male  54.0      0  77.2875     0.150855
578         0       3  female   NaN      1  14.4583     0.028221
549         1       2    male   8.0      1  36.7500     0.071731

Robust scaling

Scaled values are calculated in Robust scaling according to the quantile range (defaults to IQR).

X_scaled = (X – X.median) / IQR

from sklearn.preprocessing import RobustScaler
rs = RobustScaler().fit(X_train[['Fare']])
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_robust'] = rs.transform(X_train_copy[['Fare']])
print(X_train_copy.head(6))

Output:
     Survived  Pclass     Sex   Age  SibSp     Fare  Fare_robust
857         1       1    male  51.0      0  26.5500     0.492275
52          1       1  female  49.0      1  76.7292     2.630973
386         0       3    male   1.0      5  46.9000     1.359616
124         0       1    male  54.0      0  77.2875     2.654768
578         0       3  female   NaN      1  14.4583    -0.023088
549         1       2    male   8.0      1  36.7500     0.927011

Summary

This article introduces Feature Scaling Normalization Standardization. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *