Exploratory Data Analysis

 

Overview of Exploratory Data Analysis

Exploratory data analysis (EDA) is the first and most important step performed in any data science and machine learning project. Exploratory data analysis (EDA) helps data analyst and data scientists to analyze and investigate data sets. Exploratory Data Analysis helps to summarize main characteristics of data sets and insights from data. EDA also involves data visualization methods to better showcase the properties. In other words, EDA helps us simply understanding the data sets with the help of summary and often using graphical techniques such as various plots which include Histogram, Bar Chart, Box Plot, Scatter Plots and many more.

The main objective of EDA is to help look at data in a systematical manner before making any assumptions. It can help identify features, errors in data set, detect outliers or anomalous, and helps in generating a better understanding of data and relationships between different variables. Some times the Exploratory data analysis steps may vary depending upon the type of data.

What is the need to perform EDA?

Well any machine learning model heavily relies on the data provided in order to generate accurate prediction or output. In order to get higher accuracy, we first need to understand the data we have to solve the problem using machine learning. To understand the data, we first have to explore the data from different aspects to get insights from data and get a summary with characteristic of the data. Thus, we need Exploratory data analysis to get the insights and summarize data set. The Exploratory data analysis not only gives us a summary or insights but also let us know the task we need to perform while cleaning the data and also in feature engineering to process the data, make it machine acceptable to get higher accuracy. If the machine learning model has provider raw un cleaned data then some times the model does not accept the data or we get bad predictions with higher errors.

Generalized Exploratory data analysis

EDA in broader way consist of basic data exploration, Univariate, Bivariate analysis. There is no fixed methods or steps to perform EDA. It all depends upon the data we have to solve the problems. But some of the general or most commonly used steps in EDA are: -

  1.  Identification of Features (Dependent and Independent Features).
  2.   Check the number of features and number of observations (Shape of dataset).
  3.   Check the data type of features.
  4. Check the missing or null values in each feature, some time we also replace the missing vales in EDA itself instead of feature engineering. 
  5.   Explore categorical variable.
  6. Explore numerical variable and their distribution.
  7. Check the cardinality of variables.
  8. Check or detect outliers in dataset.
  9. Check the relationship between variables.
  10. Summarize the dataset

All the steps mentioned may include extra steps depending upon the data sets. I take lot of time while performing EDA, writing code, exploring each variable and plotting them may takes a lot of time.

Recently the data science community has witnessed the development of new libraries and packages. These packages have automated the manual process of performing Exploratory Data Analysis. Some of the well-known libraries are - 

  1. Pandas-Profiling
  2. Sweetviz
  3. Dtale

Conclusion: Exploratory Data Analysis is the first step performed after data gathering, it helps to analyze the data and summarize its main characteristics before making any assumptions. It also tells us the steps need to be performed in data preprocessing and Feature Engineering. Exploratory Data Analysis is an important part of any Machine Learning project as help in providing insights to avoid any false assumptions and inferences. All we can say EDA is long and time taking process, in order to do all the steps quickly we have some libraries that offer EDA with just few lines of code and generate report out of it. These libraries are Pandas-Profiling, Sweetviz and Dtale.

 

Comments

Popular posts from this blog

Sweetviz: EDA (Exploratory Data Analysis) in two lines of code in Python