What is it?
Exploratory data analysis (EDA) is the process of reviewing your data to get a sense of what you are dealing with. Statistics derived from our data can give us a good idea of what is going on, but data visualization often gives a clearer picture.
In this post I’m only going to touch on two types of visualizations: histograms and scatter plots. There are plenty of other visualizations out there, but these are two that are normally the first ways I will visualize data.
What is it used for?
It is often the first step in solving any data related problem. You need to know what you are dealing with before you can propose a plan on how to attack (regardless of application - prediction, correlation, ANOVA, etc.)
It is used to answer the following (amongst other things):
- What is the range of my data?
- Where is my data centered?
- How spread out is my data?
- What distribution is my data from?
Obviously we can calculate statistics for each of the above questions, but visualizations help us get the whole picture at once.
- Histogram: Bar chart that has multiple equal sized bins (intervals) on the x-axis which represents the range of data values for a single variable. The y-axis is the frequency, or count, of the number of samples in a specific range.
- Scatter Plot: A plot normally used to plot a dependent variable (y-axis) against an independent variable (x-axis). For example, score on an exam (dependent) versus hours spent studying (independent). In general, it can be used to compare any two variables so long as they are matched. For example, say I want to compare the weight (x-axis) and height (y-axis) of different individuals.
When Should I Use This?
Histograms are great for getting a high level view of a single variable. We can use mean, median, and standard deviation to get a sense of the data, but a histogram fills in some of the gaps. It makes it easy to identify any issues our descriptive statistics are unable to tell us. For example I may have a bimodal distribution or a heavily skewed distribution.
Example of a Bimodal Distribution
Scatter plots should be used when you want to identify a trend or relationship between two variables which are paired (i.e. they were measured for the same subject or sample).
To identify the relationship between two variables.
To investigate the distribution of a single variable.
Assumptions, Prerequisites, and Pitfalls
Be mindful of bin size. If your bin sizes are too wide you will not be able to see the shape of your sample. See the examples below for the same data:
Bin size of 0.25
Bin size of 4
Make sure you are plotting paired data. To clarify this means make sure if you are plotting subject A’s height, don’t plot it against subject B’s weight. Always make sure it is subject A’s weight against subject A’s height, subject B against B, etc.
Example - No Code
Example - Code
Python & Matplotlib
R & ggplot2