Machine Learning – Ways to identify outliers in your data

Cleaning the data is one of the most important and the most time taking process in any Machine Learning problem.You have to have a clear understanding about the data and have to process the data well to have accurate results.

There are so many things to be considered for processing the data. Dealing with outliers – finding outliers and cleaning them if required – is one of them.

Let’s see what outliers are, ways to identify them and how to remove them.

first thing first, what are outliersRead more

“” In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.””

——————- advertisements ——————-

———————————————————

Hope the above wikipedia definition is clear enough for everyone. We will have to review the data and identify the data points which are potential outliers. We have to then work with the domain experts to ensure if those are the real outliers and can be removed. Depending on the type of data and the problem being worked upon, sometimes we may have to keep the outliers as well.

Now, let’s look at different ways to find outliers.

IQR method

We can use the IQR rule to isolate the identify data points which are appearing to be outliers.

Any points 1.5 IQR above the 75th percentile (3rd Quartile) or any points below the 25th percentile (1st Quartile) by 1.5 IQR is considered as outliers. I know that didn’t make it clear for you, but the below diagram and the explanation below will make it easy to understand.

As you can see in the above diagram (box plot), we have to calculate the quartiles (25th, 50th and 75th percentiles) of the data. IQR is the difference between 3rd Quartile and 1st Quartile (box in the diagram). Anything above the maximum point and anything below the minimum point is an outlier and may need to be removed. maximum point is the 1.5 IQR above the Q3 point and minimum point is the 1.5 IQR below the Q1 point.

——————- advertisements ——————-

———————————————————

A small python code below for the same.

import numpy as np
input = (2,13,14,12,18,15,16,14,25)
q1,q3 = np.percentile(input,25), np.percentile(input,75)
IQR = q3-q1
min = q1 – (1.5*IQR)
max = q3 + (1.5*IQR)
outliers = [val for val in input if val < min or val > max]

print(

“Outliers for the datapoints {} are : \n {}”.format(input,outliers))
The output should look like,
Outliers for the datapoints (2, 13, 14, 12, 18, 15, 16, 14, 25) are :
[2, 25]
Box plots gives a good visualisation of the outliers as in the above example image.
Z-Score Method
This method is applicable for data which is assuming (or almost) standard distribution and is based on the 68-95-99.7 rule in statistics. According to the 68-95-99.7 (or empirical) rule, 68% of the data points reside within the 1st standard deviation from the mean of the distribution. 95% within 2 standard deviation and 99.7% within 3 standard deviation. That means almost all the data points reside within 3 standard deviation.
——————- advertisements ——————-

———————————————————

z-score is calculated with the below equation.
z= Xmean/standard deviation
3 standard deviation is being chosen mostly in z-score calculation (as per empirical rule) but is a user choice.
You may see reference about Standard deviation method as well, which is basically similar to z-score method. There, we are considering the data points outside the cutoff standard deviation point (usually 3 standard deviation) as outliers.
Hope this post helps you. Please share your feedback/suggestions in the comments section below.

Machine Learning / Artificial Intelligence – basic pre-requisites to get started

Machine Learning or Artificial Intelligence technologies are booming (or should I say boooooming..!!) nowadays, maybe already. Many of you are planning to get started with your AI or ML studies, but not clear on where to start and what are the pre-requisites. Yes, a bit different area from your current Infrastructure/Software/Database/Coding domain and which requires a bit (not really, a lot more) of additional learning.
Let’s try to take a look at the most common pre-requisites for getting started with your AI/ML journey.
 

Read more

Mathematics
Maths, but how much of maths ?? 
Not too much, it is basically a refresh required for your high-school maths. You just have to remember linear algebra (linear equations, matrices and vectors) stuff and basics of calculus (rate of change, integration, differentiation etc…). Remember, it is required for you to understand how the models work. You do not have to do all these calculations etc… by your own, or you do not have to code all these in maths terms. The model will do it, maths is for you to understand what is happening in the back-end. 
——————- advertisements ——————-

———————————————————

Statistics and Probability 

It’s a lot to learn, if you are not too good in it. You should be having a good level of statistics knowledge to understand the AI/ML stuff, as the basics are all based on the statistics terms. 
You should be well versed with the descriptive statistics including the mean, median, standard deviation, variance etc… of the data. Same applies with the Probability theory. You have to be too good with it, able to understand the data distribution including distribution functions PDF (Probability Density Function) and CDF (Cumulative Density Function) (and PMF etc…).
——————- advertisements ——————-

———————————————————

Histograms and other common plotting – visualisation – methods, a better understanding of data (covariance, correlation etc…).

The more statistics you know the better you can understand. 
 
Programming
Is it necessary to know coding ??
I would say, yes. There are MLaaS (Machine Learning as a Service) tools which will do the most part for you where you don’t have to write a single line of code. But I would prefer coding it yourself for a better understanding and that maybe necessary based on your job role.
Python – is where everyone gets landed, based on its popularity and it being Machine Learning friendly (just kidding, it has rich set of modules available for ML). If you are good at R, C/C++, Java, Julia, Scala, Ruby… still you are at a better position, 1 item (programming) is already checked.
——————- advertisements ——————-

———————————————————

 
Useful links
Let’s me add it soon…
 
I believe that covers the basic requirements to get started with your ML/AI learning. The more you work with data, the better you would be with it. It requires a lot, lot of learning and even more practice. Please feel free to add your suggestions/queries in the comments section.

Understanding Supervised and Unsupervised (Machine) Learning

As you start with your Machine Learning, you will get to hear a lot about the terms Supervised and Unsupervised learning. You will find a lot of blogs, videos etc… explaining and differentiating these 2 types. This is just another attempt to explain with some examples.

Supervised learning

Supervised learning means you (a supervisor) is training the machine to identify few patterns from the data you provided. Here the data will have some clear indications (labels) about the pattern for the machine to learn from. Machine can use this learning to find similar patterns in the new dataset.

Taking an example, you are giving few Apples and Oranges to a kid and you are identifying Read more

some of them as Apples and others as Oranges based on their characteristics (colour, hardness etc…). Next time you give him an Orange, he will be able to identify it as Orange from his previous experience.

——————- advertisements ——————-

———————————————————

Classification and Regression are 2 models of Machine Learning falling under the Supervised Learning category. We will learn these in detail in later posts, but just to give you an overview..

Classification model is used for classifying the input data (classifying emails to spam or not, medical diagnosis – cancer or not etc…)

Regression model is used to identify a continuous relation of the output values with given input values (predicting the rent amount for houses with the listed different features)

Unsupervised learning

In this type of learning, the data provided is not labelled or classified. Machine tries to find some patterns or similarities in supplied data and group or sort it.

considering a similar example which we discussed for supervised learning, you are giving a basket of Apples and Oranges to your kid. You are not giving any special instructions about the fruits in the basket, but your kid will still be able to form 2 groups, 1 of Apples and the other of Oranges. He/she is able to do it by observing the characteristics like colour, softness etc…

——————- advertisements ——————-

———————————————————

Clustering and Association are 2 common models falling in this type of learning.

Clustering is forming groups from the given input based on similarities (example below)

Association is used mostly in identifying a customer’s buying pattern, which in some cases called as Market basket analysis. How many customers buying milk are buying bread (for example)

Hope you enjoyed reading this post. Please feel free to share your thoughts in the comments section.