# How to Detect and Handle Outliers

## How to Detect and Handle Outliers – Feature Engineering and Feature Selection in Data Mining

• Detect outliers by an arbitrary boundary
• Detect outliers using Interquartile Ranges Rule
• Detect outliers using Mean and Standard Deviation Method
• Imputation of outliers with an arbitrary value
• Imputation of outliers with Mean, Median, Mode

## Video Tutorial – How to Detect and Handle Outliers

First, we will import the required libraries like pandas, NumPy, os, and outlier from feature_cleaning.

```import pandas as pd
import numpy as np
import os
from feature_cleaning import outlier as ot```

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

```use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

print(data.shape)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(3) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

### Detect outliers by an arbitrary boundary

Use the outlier_detect_arbitrary() function to find the outliers based on arbitrary boundaries.

```index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5)
print('Upper bound:',para[0],'\nLower bound:',para[1])```

Num of outlier detected: 19
The proportion of outliers detected was 0.02132435465768799
Upper bound: 300
Lower bound: 5

Use the following function to display the detected outliers.

```data.loc[index,'Fare'].sort_values()

Output:
179      0.0000
806      0.0000
732      0.0000
674      0.0000
633      0.0000
597      0.0000
815      0.0000
466      0.0000
481      0.0000
302      0.0000
277      0.0000
271      0.0000
263      0.0000
413      0.0000
822      0.0000
378      4.0125
679    512.3292
737    512.3292
258    512.3292```

### Detect outliers using Interquartile Ranges Rule

As shown in the diagram, in Interquartile Ranges Rule, first we find the Q1 and Q3. The difference between Q1 and Q3 that is IRQ is calculated. Finally, minimum and maximum boundaries are calculated using the formula given in the below diagram. Anything which falls below minimum and above the maximum is said to be an outlier.

```index,para = ot.outlier_detect_IQR(data=data,col='Fare',threshold=5)
print('Upper bound:',para[0],'\nLower bound:',para[1])```

Num of outlier detected: 31
The proportion of outliers detected was 0.03479236812570146
Upper bound: 146.448
Lower bound: -107.53760000000001

Use the following function to display the detected outliers.

```data.loc[index,'Fare'].sort_values()

Output:
31     146.5208
195    146.5208
305    151.5500
708    151.5500
297    151.5500
498    151.5500
609    153.4625
332    153.4625
268    153.4625
318    164.8667
856    164.8667
730    211.3375
779    211.3375
689    211.3375
377    211.5000
527    221.7792
700    227.5250
716    227.5250
557    227.5250
380    227.5250
299    247.5208
118    247.5208
311    262.3750
742    262.3750
341    263.0000
88     263.0000
438    263.0000
27     263.0000
679    512.3292
258    512.3292
737    512.3292```

### Detect outliers using Mean and Standard Deviation Method

```index,para = ot.outlier_detect_mean_std(data=data,col='Fare',threshold=3)
print('Upper bound:',para[0],'\nLower bound:',para[1])```

Num of outlier detected: 20
The proportion of outliers detected was 0.02244668911335578
Upper bound: 181.2844937601173
Lower bound: -116.87607782296811

Use the following function to display the detected outliers.

```data.loc[index,'Fare'].sort_values()

Output:
779    211.3375
730    211.3375
689    211.3375
377    211.5000
527    221.7792
716    227.5250
700    227.5250
380    227.5250
557    227.5250
118    247.5208
299    247.5208
311    262.3750
742    262.3750
27     263.0000
341    263.0000
88     263.0000
438    263.0000
258    512.3292
737    512.3292
679    512.3292```

### Imputation of outliers with an arbitrary value

Here first we need to find the outliers using any method discussed above. Once the outliers are detected they can be handled using different methods. Here found the outliers using the arbitrary method and displayed and displayed examples from 261 to 272. For the fare column, examples 263, and 271 contain outliers.

```index,para = ot.outlier_detect_arbitrary(data=data,col='Fare',upper_fence=300,lower_fence=5)
data[259:273]```

Now, we replace all outliers with an arbitrary value -999.

```data2 = ot.impute_outlier_with_arbitrary(data=data,outlier_index=index,value=-999,col=['Fare'])
data2[261:273]```

### Imputation of outliers with Mean

```data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mean')
data2[261:273]```

### Imputation of outliers with Median

```data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='medien')
data2[261:273]```

### Imputation of outliers with Mode

```data5 = ot.impute_outlier_with_avg(data=data,col='Fare',outlier_index=index,strategy='mode')
data2[261:273]```

Finally, we can delete the rows with the outliers using the drop_outlier() function.

```data4 = ot.drop_outlier(data=data,outlier_index=index)
print (data4.shape)```

Output is :

`(872, 6)`

It shows that 19 rows with outliers were removed. Hence only 872 rows out of 891 are remaining.