Machine Learning – Ways to identify outliers in your data
Cleaning the data is one of the most important and the most time taking process in any Machine Learning problem.You have to have a clear understanding about the data and have to process the data well to have accurate results.
There are so many things to be considered for processing the data. Dealing with outliers – finding outliers and cleaning them if required – is one of them.
Let’s see what outliers are, ways to identify them and how to remove them.
first thing first, what are outliers… Read more
“” In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.””
———————————————————
Hope the above wikipedia definition is clear enough for everyone. We will have to review the data and identify the data points which are potential outliers. We have to then work with the domain experts to ensure if those are the real outliers and can be removed. Depending on the type of data and the problem being worked upon, sometimes we may have to keep the outliers as well.
Now, let’s look at different ways to find outliers.
IQR method
We can use the IQR rule to isolate the identify data points which are appearing to be outliers.
Any points 1.5 IQR above the 75th percentile (3rd Quartile) or any points below the 25th percentile (1st Quartile) by 1.5 IQR is considered as outliers. I know that didn’t make it clear for you, but the below diagram and the explanation below will make it easy to understand.
As you can see in the above diagram (box plot), we have to calculate the quartiles (25th, 50th and 75th percentiles) of the data. IQR is the difference between 3rd Quartile and 1st Quartile (box in the diagram). Anything above the maximum point and anything below the minimum point is an outlier and may need to be removed. maximum point is the 1.5 IQR above the Q3 point and minimum point is the 1.5 IQR below the Q1 point.
———————————————————
A small python code below for the same.
import numpy as npinput = (2,13,14,12,18,15,16,14,25)q1,q3 = np.percentile(input,25), np.percentile(input,75)IQR = q3-q1min = q1 – (1.5*IQR)max = q3 + (1.5*IQR)outliers = [val for val in input if val < min or val > max]print(
“Outliers for the datapoints {} are : \n {}”.format(input,outliers))
Outliers for the datapoints (2, 13, 14, 12, 18, 15, 16, 14, 25) are :
[2, 25]
———————————————————