Feature Scaling Normalization Standardization

Feature Scaling Normalization Standardization in Data Mining

In this article, I will discuss,

Normalization – Standardization (Z-score scaling)
Min-Max scaling
Robust scaling

Video Tutorial – Feature Scaling Normalization Standardization

Click here to download the dataset titanic.csv file, which is used in this article for demonstration.

First, we will import the required libraries like pandas, NumPy, os, and train_test_split from sklearn.model_selection.

import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

print(data.shape)

data.head(3)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

	Survived	Pclass	Sex	Age	SibSp	Fare
O	O	3	male	22.0	1	7.2500
1	1	1	female	38.0	1	71.2833
2	1	3	female	26.0	O	7.9250

Normalization – Standardization (Z-score scaling)

To check whether the data is already normalized. If the mean = 0 and standard deviation = 1, then the data is already normalized. Here there is no need to do feature scaling.

print(X_train['Fare'].mean())
print(X_train['Fare'].std())

Output:
32.458272552166925
48.257658284816124

The Z-score scaling is performed using the below formula.

z = (X – X.mean) / std

# add the new created feature
from sklearn.preprocessing import StandardScaler
ss = StandardScaler().fit(X_train[['Fare']])
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_zscore'] = ss.transform(X_train_copy[['Fare']])
print(X_train_copy.head(6))

Output:
     Survived  Pclass     Sex   Age  SibSp     Fare  Fare_zscore
857         1       1    male  51.0      0  26.5500    -0.122530
52          1       1  female  49.0      1  76.7292     0.918124
386         0       3    male   1.0      5  46.9000     0.299503
124         0       1    male  54.0      0  77.2875     0.929702
578         0       3  female   NaN      1  14.4583    -0.373297
549         1       2    male   8.0      1  36.7500     0.089005

Now we find the mean and standard deviation.

print(X_train_copy['Fare_zscore'].mean())
print(X_train_copy['Fare_zscore'].std())

Output:
5.916437306188636e-17
1.0008035356861

Min-Max scaling

Scaled values are calculated in Min-Max scaling is performed using the below formula.

Robust scaling

Scaled values are calculated in Robust scaling according to the quantile range (defaults to IQR).

X_scaled = (X – X.median) / IQR

from sklearn.preprocessing import RobustScaler
rs = RobustScaler().fit(X_train[['Fare']])
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_robust'] = rs.transform(X_train_copy[['Fare']])
print(X_train_copy.head(6))

Output:
     Survived  Pclass     Sex   Age  SibSp     Fare  Fare_robust
857         1       1    male  51.0      0  26.5500     0.492275
52          1       1  female  49.0      1  76.7292     2.630973
386         0       3    male   1.0      5  46.9000     1.359616
124         0       1    male  54.0      0  77.2875     2.654768
578         0       3  female   NaN      1  14.4583    -0.023088
549         1       2    male   8.0      1  36.7500     0.927011

Summary

This article introduces Feature Scaling Normalization Standardization. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Feature Scaling Normalization Standardization

Computer Graphics OpenGL Mini Projects

Download Final Year Projects