Machine Learning – Ways to identify outliers in your data

Cleaning the data is one of the most important and the most time taking process in any Machine Learning problem.You have to have a clear understanding about the data and have to process the data well to have accurate results.

There are so many things to be considered for processing the data. Dealing with outliers – finding outliers and cleaning them if required – is one of them.

Let’s see what outliers are, ways to identify them and how to remove them.

first thing first, what are outliers

“” In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.””

——————- advertisements ——————-

———————————————————

Hope the above wikipedia definition is clear enough for everyone. We will have to review the data and identify the data points which are potential outliers. We have to then work with the domain experts to ensure if those are the real outliers and can be removed. Depending on the type of data and the problem being worked upon, sometimes we may have to keep the outliers as well.

Now, let’s look at different ways to find outliers.

IQR method

We can use the IQR rule to isolate the identify data points which are appearing to be outliers.

Any points 1.5 IQR above the 75th percentile (3rd Quartile) or any points below the 25th percentile (1st Quartile) by 1.5 IQR is considered as outliers. I know that didn’t make it clear for you, but the below diagram and the explanation below will make it easy to understand.

As you can see in the above diagram (box plot), we have to calculate the quartiles (25th, 50th and 75th percentiles) of the data. IQR is the difference between 3rd Quartile and 1st Quartile (box in the diagram). Anything above the maximum point and anything below the minimum point is an outlier and may need to be removed. maximum point is the 1.5 IQR above the Q3 point and minimum point is the 1.5 IQR below the Q1 point.

——————- advertisements ——————-

———————————————————

A small python code below for the same.

import numpy as np
input = (2,13,14,12,18,15,16,14,25)
q1,q3 = np.percentile(input,25), np.percentile(input,75)
IQR = q3-q1
min = q1 – (1.5*IQR)
max = q3 + (1.5*IQR)
outliers = [val for val in input if val < min or val > max]

print(

“Outliers for the datapoints {} are : \n {}”.format(input,outliers))
The output should look like,
Outliers for the datapoints (2, 13, 14, 12, 18, 15, 16, 14, 25) are :
[2, 25]
Box plots gives a good visualisation of the outliers as in the above example image.
Z-Score Method
This method is applicable for data which is assuming (or almost) standard distribution and is based on the 68-95-99.7 rule in statistics. According to the 68-95-99.7 (or empirical) rule, 68% of the data points reside within the 1st standard deviation from the mean of the distribution. 95% within 2 standard deviation and 99.7% within 3 standard deviation. That means almost all the data points reside within 3 standard deviation.
——————- advertisements ——————-

———————————————————

z-score is calculated with the below equation.
z= Xmean/standard deviation
3 standard deviation is being chosen mostly in z-score calculation (as per empirical rule) but is a user choice.
You may see reference about Standard deviation method as well, which is basically similar to z-score method. There, we are considering the data points outside the cutoff standard deviation point (usually 3 standard deviation) as outliers.
Hope this post helps you. Please share your feedback/suggestions in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *