How to Detect and Handle Outliers

 

How to Detect and Handle Outliers – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

  • Detect outliers by an arbitrary boundary
  • Detect outliers using Interquartile Ranges Rule
  • Detect outliers using Mean and Standard Deviation Method
  • Imputation of outliers with an arbitrary value
  • Imputation of outliers with Mean, Median, Mode
  • How to Discard outliers

Video Tutorial – How to Detect and Handle Outliers

Click here to download the dataset titanic.csv file, which is used in this article for demonstration.

First, we will import the required libraries like pandas, NumPy, os, and outlier from feature_cleaning.

import pandas as pd
import numpy as np
import os
from feature_cleaning import outlier as ot

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

print(data.shape)

data.head(8)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

SurvivedPclassSexAgeSibSpFare
OO3male22.017.2500
111female38.0171.2833
213female26.0O7.9250
Data set

Detect outliers by an arbitrary boundary

Use the outlier_detect_arbitrary() function to find the outliers based on arbitrary boundaries.

index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5)
print('Upper bound:',para[0],'\nLower bound:',para[1])

Num of outlier detected: 19
The proportion of outliers detected was 0.02132435465768799
Upper bound: 300
Lower bound: 5

Use the following function to display the detected outliers.

data.loc[index,'Fare'].sort_values()

Output:
179      0.0000
806      0.0000
732      0.0000
674      0.0000
633      0.0000
597      0.0000
815      0.0000
466      0.0000
481      0.0000
302      0.0000
277      0.0000
271      0.0000
263      0.0000
413      0.0000
822      0.0000
378      4.0125
679    512.3292
737    512.3292
258    512.3292

Detect outliers using Interquartile Ranges Rule

As shown in the diagram, in Interquartile Ranges Rule, first we find the Q1 and Q3. The difference between Q1 and Q3 that is IRQ is calculated. Finally, minimum and maximum boundaries are calculated using the formula given in the below diagram. Anything which falls below minimum and above the maximum is said to be an outlier.

Interquartile Ranges Rule
index,para = ot.outlier_detect_IQR(data=data,col='Fare',threshold=5)
print('Upper bound:',para[0],'\nLower bound:',para[1])

Num of outlier detected: 31
The proportion of outliers detected was 0.03479236812570146
Upper bound: 146.448
Lower bound: -107.53760000000001

Use the following function to display the detected outliers.

data.loc[index,'Fare'].sort_values()

Output:
31     146.5208
195    146.5208
305    151.5500
708    151.5500
297    151.5500
498    151.5500
609    153.4625
332    153.4625
268    153.4625
318    164.8667
856    164.8667
730    211.3375
779    211.3375
689    211.3375
377    211.5000
527    221.7792
700    227.5250
716    227.5250
557    227.5250
380    227.5250
299    247.5208
118    247.5208
311    262.3750
742    262.3750
341    263.0000
88     263.0000
438    263.0000
27     263.0000
679    512.3292
258    512.3292
737    512.3292

Detect outliers using Mean and Standard Deviation Method

index,para = ot.outlier_detect_mean_std(data=data,col='Fare',threshold=3)
print('Upper bound:',para[0],'\nLower bound:',para[1])

Num of outlier detected: 20
The proportion of outliers detected was 0.02244668911335578
Upper bound: 181.2844937601173
Lower bound: -116.87607782296811

Use the following function to display the detected outliers.

data.loc[index,'Fare'].sort_values()

Output:
779    211.3375
730    211.3375
689    211.3375
377    211.5000
527    221.7792
716    227.5250
700    227.5250
380    227.5250
557    227.5250
118    247.5208
299    247.5208
311    262.3750
742    262.3750
27     263.0000
341    263.0000
88     263.0000
438    263.0000
258    512.3292
737    512.3292
679    512.3292

Imputation of outliers with an arbitrary value

Here first we need to find the outliers using any method discussed above. Once the outliers are detected they can be handled using different methods. Here found the outliers using the arbitrary method and displayed and displayed examples from 261 to 272. For the fare column, examples 263, and 271 contain outliers.

index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5)
data[259:273]
SurvivedPclassSexAgeSibSpFare
26113male3.0431.3875
262o1male52.0179.6500
263o1male40.0o0.0000
264o3femaleNaNo7.7500
265o2male36.0o10.5000
266o3male16.0439.6875
26713male25.017.7750
26811female58.0o153.4625
26911female35.0o135.6333
270o1maleNaNo31.0000
27113male25.0o0.0000
27212 female41.0o19.5000

Now, we replace all outliers with an arbitrary value -999.

data2 = ot.impute_outlier_with_arbitrary(data=data,outlier_index=index,value=-999,col=['Fare'])
data2[261:273]
SurvivedPclassSexAgeSibSpFare
26113male3.0431.3875
262o1male52.0179.6500
263o1male40.0o-999
264o3femaleNaNo7.7500
265o2male36.0o10.5000
266o3male16.0439.6875
26713male25.017.7750
26811female58.0o153.4625
26911female35.0o135.6333
270o1maleNaNo31.0000
27113male25.0o-999
27212 female41.0o19.5000

Imputation of outliers with Mean

data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mean')
data2[261:273]
SurvivedPclassSexAgeSibSpFare
26113male3.0431.3875
262o1male52.0179.6500
263o1male40.0o32.204208
264o3femaleNaNo7.7500
265o2male36.0o10.5000
266o3male16.0439.6875
26713male25.017.7750
26811female58.0o153.4625
26911female35.0o135.6333
270o1maleNaNo31.0000
27113male25.0o32.204208
27212 female41.0o19.5000

Imputation of outliers with Median

data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='medien')
data2[261:273]
SurvivedPclassSexAgeSibSpFare
26113male3.0431.3875
262o1male52.0179.6500
263o1male40.0o14.4542
264o3femaleNaNo7.7500
265o2male36.0o10.5000
266o3male16.0439.6875
26713male25.017.7750
26811female58.0o153.4625
26911female35.0o135.6333
270o1maleNaNo31.0000
27113male25.0o14.4542
27212 female41.0o19.5000

Imputation of outliers with Mode

data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mode')
data2[261:273]
SurvivedPclassSexAgeSibSpFare
26113male3.0431.3875
262o1male52.0179.6500
263o1male40.0o8.0500
264o3femaleNaNo7.7500
265o2male36.0o10.5000
266o3male16.0439.6875
26713male25.017.7750
26811female58.0o153.4625
26911female35.0o135.6333
270o1maleNaNo31.0000
27113male25.0o8.0500
27212 female41.0o19.5000

How to Discard outliers

Finally, we can delete the rows with the outliers using the drop_outlier() function.

data4 = ot.drop_outlier(data=data,outlier_index=index)
print (data4.shape)

Output is :

(872, 6)

It shows that 19 rows with outliers were removed. Hence only 872 rows out of 891 are remaining.

Summary

This article introduces How to Detect and Handle Outliers – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *