Missing Values in Data Mining

 

How to Handle Missing Values – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

  • How to check the Missing values in the given dataset
  • Listwise deletion – Deleting the missing values
  • Arbitrary Value Imputation
  • Mean/Median/Mode Imputation
  • Random Imputation

Video Tutorial – Missing Values in Data Mining

Click here to download the dataset titanic.csv file, which is used in this article for demonstration.

First, we will import the required libraries.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
plt.style.use('seaborn-colorblind')
%matplotlib inline
from data_exploration import explore

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

print(data.shape)

data.head(8)

Now we display the first eight rows, to confirm whether the dataset is read successfully or not using the data.head(8) function. Also, the shape of the dataset is displayed using the shape function. In this case shape of the dataset is (891, 6). It indicates that there are 861 rows and 6 columns are present in the dataset.

SurvivedPclassSexAgeSibSpFare
OO3male22.017.2500
111female38.0171.2833
213female26.0O7.9250
311female35.0153.1000
4O3male35.0O8.0500
5O3maleNaNO8.4583
6O1male54.0O51.8625
7O3male2.0321.0750
Data set

Missing value checking

check_missing() function from the missing library is used to check the total number of missing values & percentage of missing values per variable of a Pandas Dataframe.

# only variable Age has missing values, totally 177 cases
# result is saved at the output dir (if given)

ms.check_missing(data=data,output_path=r'./output/')
total missingproportion
SurvivedO0.000000
PclassO0.000000
SexO0.000000
Age1770.198653
SibSpO0.000000
FareO0.000000
Number of missing values and their percentage

Listwise deletion

drop_missing() is used to delete all examples (listwise) that have missing values. Next, we display the shape of the dataset after deleting the missing values. After deleting 177 rows from the original dataset, we left with (714, 6).

# 177 cases which has NA has been dropped 
data2 = ms.drop_missing(data=data)
data2.shape

Add a variable to denote NA

add_var_denote_NA() function is used to create an additional variable indicating whether the data was missing for that observation.

# Age_is_NA is created, 0-not missing 1-missing for that observation
data3 = ms.add_var_denote_NA(data=data,NA_col=['Age'])
print(data3.Age_is_NA.value_counts())
data3.head(8)

The missing values are replaced to 1 and others are replaced with 0.

SurvivedPclassSexAgeSibSpFareAge_is_NA
OO3male22.017.2500O
111female38.0171.2833O
213female26.0O7.9250O
311female35.0153.1000O
4O3male35.0O8.0500O
5O3maleNaNO8.45831
6O1male54.0O51.8625O
7O3male2.0321.0750O

Arbitrary Value Imputation

Arbitrary Value Imputation is a process where the missing values (represented by NA) are replaced with Arbitrary Values. impute_NA_with_arbitrary() function is used to replace NA with arbitrary value. Here NA is replaced with -999.

data4 = ms.impute_NA_with_arbitrary(data=data,impute_value=-999,NA_col=['Age'])
data4.head(8)
SurvivedPclassSexAgeSibSpFareAge_-999
OO3male22.017.250022.0
111female38.0171.283338.0
213female26.0O7.925026.0
311female35.0153.100035.0
4O3male35.0O8.050035.0
5O3maleNaNO8.4583-999.0
6O1male54.0O51.862554.0
7O3male2.0321.07502.0

Mean / Median / Mode Imputation

Missing values (NA) are replaced with mean, median, or Mode of that column. The impute_NA_with_avg() function is used to find the mean, median, and mode by setting the strategy as mean, median, or mode respectively.

print(data.Age.mean())
data5 = ms.impute_NA_with_avg(data=data,strategy='mean',NA_col=['Age'])
data5.head(8)
//Mean is 29.69911764705882
SurvivedPclassSexAgeSibSpFareAge_impute_mean
OO3male22.017.250022.000000
111female38.0171.283338.000000
213female26.0O7.925026.000000
311female35.0153.100035.000000
4O3male35.0O8.050035.000000
5O3maleNaNO8.458329.699118
6O1male54.0O51.862554.000000
7O3male2.0321.07502.000000
print(data.Age.mean())
data5 = ms.impute_NA_with_avg(data=data,strategy='median',NA_col=['Age'])
data5.head(8)
//Median is 28.0
SurvivedPclassSexAgeSibSpFareAge_impute_mean
OO3male22.017.250022.000000
111female38.0171.283338.000000
213female26.0O7.925026.000000
311female35.0153.100035.000000
4O3male35.0O8.050035.000000
5O3maleNaNO8.458328.000000
6O1male54.0O51.862554.000000
7O3male2.0321.07502.000000
print(data.Age.mean())
data5 = ms.impute_NA_with_avg(data=data,strategy='mode',NA_col=['Age'])
data5.head(8)
//Mode is 24
SurvivedPclassSexAgeSibSpFareAge_impute_mean
OO3male22.017.250022.000000
111female38.0171.283338.000000
213female26.0O7.925026.000000
311female35.0153.100035.000000
4O3male35.0O8.050035.000000
5O3maleNaNO8.458324.000000
6O1male54.0O51.862554.000000
7O3male2.0321.07502.000000

Random Imputation

Here a random number is generated from the pool of available observations. Then the missing values are replaced with a random value. The impute_NA_with_random() function is used to generate a random number.

data7 = ms.impute_NA_with_random(data=data,NA_col=['Age'])
data7.head(8)
SurvivedPclassSexAgeSibSpFareAge_impute_mean
OO3male22.017.250022.000000
111female38.0171.283338.000000
213female26.0O7.925026.000000
311female35.0153.100035.000000
4O3male35.0O8.050035.000000
5O3maleNaNO8.458328.000000
6O1male54.0O51.862554.000000
7O3male2.0321.07502.000000

Summary

This article introduces How to Handle Missing Values – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *