Search This Blog

Sunday, 26 November 2017

Machine Learning 101

Data science is trending these days, everybody wants to become data scientist because it is cutting edge and has high paying jobs.

I am also starting to learn this coooool stuff and best way to learn is write some code and get the feel of it.
I will start blog post series how to approach machine learning problem without getting lost in jargons and complex mathematical formulas.

Some of problem suited for ML are
 - Email classification ( Spam Vs Ham)
 - Books/Hotel/Food etc review classification
 - Sentiment analysis of brand or product.

In each of the above problem output is some kind of label(spam,ham, positive, negative etc)
These type of problem are classification problem, lets look at diagram to understand this.


Email is problem instance.
Classifier is the process that know rules to classify
Spam/Ham is the output. If there is only 2 output then it is called binary classifier.

Lets zoom in "Classifier", it is rule engine that know all the business rules to do the work.
Rules can be static or dynamic, in some case static rules are fine but for many problem rules keeps on changing and it becomes overhead to keep rules up to date.

This is where ML comes in picture, system learns new rules by it self.

How does system learns ?
You have to train the system and after that training system identifies the patterns, this type of ML is called Supervised learning

Each Supervised learning has Training phase before it starts making predication.

Think training phase as practice, it contains both questions and answers.

Once training phase is done then classifier is ready for use on new questions without answer.

There is one more phase called "Test", it is used to check how good classifier is, if it is not good then it means training data was not good.

Now lets try to do some sentiment analysis using imdb reviews


ReviewType
Wasted two hoursNegative
A bit predictableNegative
The story itself is just predictable and lazyNegative
Buy it, play it, enjoy it, love itPositive
It deserves strong lovePositive

In learning phase model takes above data and works out pattern for positive and negative comments.
You must be thinking how model knows about pattern.

One way to look at the problem is using probability of word being in positive comment vs negative comment and then take the one that is greater.

Lets look at example to understand this.

"Wasted" and "predictable" only appears only in -ve, so if any comment that contains these word they have high chance of being -ve comment.

"Love"  comes only in the +ve .

So algorithm is very simple.

 - Split sentence in words
 - POS Score = (Compute +ve score of words & multiply them)
 - NEG Score = (Compute -ve score of words & multiply them)
 - Result = Max(Pos Score,Neg Score)


Before scores are compared we have to multiply with overall probability of positive or negative comments.

In our toy example

Total Reviews5
Pos Comment2
Neg Comment3
Pos Probability( 2/5)0.4
Neg Probability( 3/5)0.6
 
POS ScorePos(words in sentence) * Overall Pos(i.e 0.4)
Neg ScoreNeg(words in sentence) * Overall Neg(i.e 0.6)

You don't have to code this algorithm, it is already implemented in Naive Bayes classifier.
Although this classifier has 'Naive' in name but it is not really naive, it gives very good result with around close to 80% accuracy and it is used in many production system.

All language/framework like python,spark,R has implementation for this classifier. 

I have implemented this classifier in python/spark using Sentiment Labelled Sentences data from UCI ML repo.

This sample data contains reviews from imdb,yelp & amazon.

Code is available @ github


No comments:

Post a Comment