Data Exploration in Data Mining

Introduction to Data Exploration – Feature Engineering and Feature Selection in Data Mining

In this article, I will discuss,

  • How to read the dataset?
  • How to know the data types of columns?
  • general Data Description
  • Univariate analysis and Bi-Variate Analysis

Video Tutorial – Data Exploration in Data Mining

Click here to download the titanic.csv file, the dataset used in this demonstration.

First, we will import the required libraries like pandas, numpy, seaborn, matplotlib, and explore from data_exploration.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
plt.style.use('seaborn-colorblind')
%matplotlib inline
from data_exploration import explore

Next, we use the read_csv() function from the pandas library to read the dataset. We are interested in few columns hence a list with use_cols is created with required columns.

use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']

data = pd.read_csv('./data/titanic.csv', usecols=use_cols)

Now we display the first five rows, to confirm whether the dataset is read successfully or not using the data.head(5) function.

See also  How to Detect and Handle Outliers

SurvivedPclassSexAgeAgeSibSpFare
OO3male22.017.2500
111female38.0171.2833
213female26.0O7.9250
311female35.0153.1000
4O3male35.0O8.0500
First 5 rows of the dataset

Univariate Analysis

Below are some methods that can give us the basic stats on the variable:

  • pandas.Dataframe.dtypes
  • pandas.Dataframe.describe()
  • Barplot
  • Countplot
  • Boxplot
  • Distplot

pandas.Dataframe.dtypes

Now we use the get_dtypes() function to get the types of each column and display them.

str_var_list, num_var_list, all_var_list = explore.get_dtypes(data=data)
print(str_var_list) # string type
print(num_var_list) # numeric type
print(all_var_list) # all

Output:

[‘Sex’]

[‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Fare’]

[‘Sex’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Fare’]

pandas.Dataframe.describe()

Next, we use the describe() function to get the general description of dataset. The describe() function displays different statistics like, count, unique values, frequency, mean, standard deviation, minimum, maximum, 25%, 50% and 75% percentile.

explore.describe(data=data,output_path=r'./output/')

Out of describe() function():

SurvivedPclassSexAgeSibSpFare
count891.000000891.000000891714.000000891.000000891.000000
uniqueNaNNaN2NaNNaNNaN
topNaNNaNmaleNaNNaNNaN
freqNaNNaN577NaNNaNNaN
mean0.3838382.308642NaN29.6991180.52300832.204208
std0.4865920.836071NaN14.5264971.10274349.693429
min0.0000001.000000NaN0.4200000.0000000.000000
25%0.0000002.000000NaN20.1250000.0000007.910400
50%0.0000003.000000NaN28.0000000.00000014.454200
75%1.0000003.000000NaN38.0000001.00000031.000000
max1.0000003.000000NaN80.0000008.000000512.329200
Data Description

Discrete variable barplot

discrete_var_barplot() function is used to draw the barplot of a discrete variable x against y (that is target variable). By default, the bar shows the mean value of y.

explore.discrete_var_barplot(x='Pclass',y='Survived',data=data,output_path='./output/')
Discrete variable barplot
Discrete variable barplot

Discrete variable countplot

discrete_var_countplot() function is used to draw the countplot of a discrete variable x.

explore.discrete_var_countplot(x='Pclass',data=data,output_path='./output/')
Discrete variable countplot
Discrete variable countplot

Discrete variable boxplot

discrete_var_boxplot() function is used to draw the boxplot of a discrete variable x against y.

explore.discrete_var_boxplot(x='Pclass',y='Fare',data=data,output_path='./output/')
Discrete variable boxplot
Discrete variable boxplot

Bi-variate Analysis

Bi-variate Analysis is performed to understand the descriptive statistics between two or more variables.

  • Scatter Plot
  • Correlation Plot
  • Heat Map
See also  How to Detect and Handle Outliers

Continuous variable distplot

continuous_var_distplot() issued to draw the distplot of a continuous variable x.

explore.continuous_var_distplot(x=data['Fare'],output_path='./output/')
Continuous variable distplot
Continuous variable distplot

Correlation plot

correlation_plot() function I used to draw the correlation plot between variables.

explore.correlation_plot(data=data,output_path='./output/')
Correlation plot
Correlation plot

Summary

This article introduces the Data Exploration – FeatureEngineering and Feature Selection in Data Mining. If you like the material share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *