Missing Values in Data Mining

Download Final Year Projects

How to Handle Missing Values – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

How to check the Missing values in the given dataset
Listwise deletion – Deleting the missing values
Arbitrary Value Imputation
Mean/Median/Mode Imputation
Random Imputation

Video Tutorial – Missing Values in Data Mining

Click here to download the dataset titanic.csv file, which is used in this article for demonstration.

First, we will import the required libraries.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
plt.style.use('seaborn-colorblind')
%matplotlib inline
from data_exploration import explore

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

print(data.shape)

data.head(8)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(8) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

	Survived	Pclass	Sex	Age	SibSp	Fare
O	O	3	male	22.0	1	7.2500
1	1	1	female	38.0	1	71.2833
2	1	3	female	26.0	O	7.9250
3	1	1	female	35.0	1	53.1000
4	O	3	male	35.0	O	8.0500
5	O	3	male	NaN	O	8.4583
6	O	1	male	54.0	O	51.8625
7	O	3	male	2.0	3	21.0750

Data set

Missing value checking

check_missing() function from the missing library is used to check the total number of missing values & percentage of missing values per variable of a Pandas Dataframe.

# only variable Age has missing values, totally 177 cases
# result is saved at the output dir (if given)

ms.check_missing(data=data,output_path=r'./output/')

	total missing	proportion
Survived	O	0.000000
Pclass	O	0.000000
Sex	O	0.000000
Age	177	0.198653
SibSp	O	0.000000
Fare	O	0.000000

Number of missing values and their percentage

Listwise deletion

drop_missing() is used to delete all examples (listwise) that have missing values. Next, we display the shape of the dataset after deleting the missing values. After deleting 177 rows from the original dataset, we left with (714, 6).

# 177 cases which has NA has been dropped 
data2 = ms.drop_missing(data=data)
data2.shape

Add a variable to denote NA

add_var_denote_NA() function is used to create an additional variable indicating whether the data was missing for that observation.

# Age_is_NA is created, 0-not missing 1-missing for that observation
data3 = ms.add_var_denote_NA(data=data,NA_col=['Age'])
print(data3.Age_is_NA.value_counts())
data3.head(8)

The missing values are replaced to 1 and others are replaced with 0.

	Survived	Pclass	Sex	Age	SibSp	Fare	Age_is_NA
O	O	3	male	22.0	1	7.2500	O
1	1	1	female	38.0	1	71.2833	O
2	1	3	female	26.0	O	7.9250	O
3	1	1	female	35.0	1	53.1000	O
4	O	3	male	35.0	O	8.0500	O
5	O	3	male	NaN	O	8.4583	1
6	O	1	male	54.0	O	51.8625	O
7	O	3	male	2.0	3	21.0750	O

Arbitrary Value Imputation

Arbitrary Value Imputation is a process where the missing values (represented by NA) are replaced with Arbitrary Values. impute_NA_with_arbitrary() function is used to replace NA with arbitrary value. Here NA is replaced with -999.

data4 = ms.impute_NA_with_arbitrary(data=data,impute_value=-999,NA_col=['Age'])
data4.head(8)

	Survived	Pclass	Sex	Age	SibSp	Fare	Age_-999
O	O	3	male	22.0	1	7.2500	22.0
1	1	1	female	38.0	1	71.2833	38.0
2	1	3	female	26.0	O	7.9250	26.0
3	1	1	female	35.0	1	53.1000	35.0
4	O	3	male	35.0	O	8.0500	35.0
5	O	3	male	NaN	O	8.4583	-999.0
6	O	1	male	54.0	O	51.8625	54.0
7	O	3	male	2.0	3	21.0750	2.0

Mean / Median / Mode Imputation

Missing values (NA) are replaced with mean, median, or Mode of that column. The impute_NA_with_avg() function is used to find the mean, median, and mode by setting the strategy as mean, median, or mode respectively.

print(data.Age.mean())
data5 = ms.impute_NA_with_avg(data=data,strategy='mean',NA_col=['Age'])
data5.head(8)
//Mean is 29.69911764705882

	Survived	Pclass	Sex	Age	SibSp	Fare	Age_impute_mean
O	O	3	male	22.0	1	7.2500	22.000000
1	1	1	female	38.0	1	71.2833	38.000000
2	1	3	female	26.0	O	7.9250	26.000000
3	1	1	female	35.0	1	53.1000	35.000000
4	O	3	male	35.0	O	8.0500	35.000000
5	O	3	male	NaN	O	8.4583	29.699118
6	O	1	male	54.0	O	51.8625	54.000000
7	O	3	male	2.0	3	21.0750	2.000000

print(data.Age.mean())
data5 = ms.impute_NA_with_avg(data=data,strategy='median',NA_col=['Age'])
data5.head(8)
//Median is 28.0

	Survived	Pclass	Sex	Age	SibSp	Fare	Age_impute_mean
O	O	3	male	22.0	1	7.2500	22.000000
1	1	1	female	38.0	1	71.2833	38.000000
2	1	3	female	26.0	O	7.9250	26.000000
3	1	1	female	35.0	1	53.1000	35.000000
4	O	3	male	35.0	O	8.0500	35.000000
5	O	3	male	NaN	O	8.4583	28.000000
6	O	1	male	54.0	O	51.8625	54.000000
7	O	3	male	2.0	3	21.0750	2.000000

print(data.Age.mean())
data5 = ms.impute_NA_with_avg(data=data,strategy='mode',NA_col=['Age'])
data5.head(8)
//Mode is 24

	Survived	Pclass	Sex	Age	SibSp	Fare	Age_impute_mean
O	O	3	male	22.0	1	7.2500	22.000000
1	1	1	female	38.0	1	71.2833	38.000000
2	1	3	female	26.0	O	7.9250	26.000000
3	1	1	female	35.0	1	53.1000	35.000000
4	O	3	male	35.0	O	8.0500	35.000000
5	O	3	male	NaN	O	8.4583	24.000000
6	O	1	male	54.0	O	51.8625	54.000000
7	O	3	male	2.0	3	21.0750	2.000000

Random Imputation

Here a random number is generated from the pool of available observations. Then the missing values are replaced with a random value. The impute_NA_with_random() function is used to generate a random number.

data7 = ms.impute_NA_with_random(data=data,NA_col=['Age'])
data7.head(8)

	Survived	Pclass	Sex	Age	SibSp	Fare	Age_impute_mean
O	O	3	male	22.0	1	7.2500	22.000000
1	1	1	female	38.0	1	71.2833	38.000000
2	1	3	female	26.0	O	7.9250	26.000000
3	1	1	female	35.0	1	53.1000	35.000000
4	O	3	male	35.0	O	8.0500	35.000000
5	O	3	male	NaN	O	8.4583	28.000000
6	O	1	male	54.0	O	51.8625	54.000000
7	O	3	male	2.0	3	21.0750	2.000000

Summary

This article introduces How to Handle Missing Values – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Missing Values in Data Mining

Download Final Year Projects

How to Handle Missing Values – Feature Engineering and Feature Selection in Data Mining

Video Tutorial – Missing Values in Data Mining

Missing value checking

Listwise deletion

Add a variable to denote NA

Arbitrary Value Imputation

Mean / Median / Mode Imputation

Random Imputation

Summary

Related Posts

Leave a Comment Cancel Reply

VTU Notes

VTU Question Papers

Projects

Tutorials