Success

Natural Language Processing (NLP) Tutorial | NLP Training | Intellipaat

Google+ Pinterest LinkedIn Tumblr

Hey guys, welcome to this session by Intellipaat. Now, you guys should have used a lot of virtual assistants such as Siri, Cortana and Alexa and these virtual assistants really do make a task very
easy, don't they? So, let's say if you ask Siri what is the distance between
Earth and the Sun, it will immediately reply the distances 149.6 million kilometers and similarly, if you ask Siri, "What is I'm
hungry?" in Hindi, it will reply, "Mai bhukha hu." Now Siri or any other virtual assistant
how is it able to do this? Well all of this is possible because of Natural
Language Processing so keeping in mind how important NLP is, we have come up with the
tutorial on Natural Language Processing.

So let's have a quick glance at the
agenda, we'll start by understanding what exactly NLP is and then we'll
learn how to tokenize words with the NLTK package. Off that, we'll learn
some concepts of NLP such as stemming, lemmatization, POS tagging, and Named
Entity Recognition. And going ahead will also, implement some of these NLP
concepts with the Spacy Package and finally there will be a quiz to
recollect whatever we have learned in today's session. So let me start off by
asking a very simple question, What do you understand by the term Natural
Language? Well simply put, Natural Language is any language which has evolved
naturally in humans.

So any of these languages English, Hindi, French, Telugu
or any of the numerous languages which has evolved naturally in humans through
use and repetition without conscious planning can be termed as Natural
Language. Now, since we are humans were able to comprehend these languages. But what
about a machine? What if these same languages are fed to a machine? Will it
be able to comprehend them? So, this is what natural language processing comes
in. So natural language processing is the ability of computer program to
understand and interpret human language as it is spoken. In other words, NLP is
used to gain knowledge from the raw textual data at a disposal. Now let's go
ahead and look at the components of NLP. So, NLP can be basically divided into
natural language understanding and natural language generation. So, let's
start with natural language understanding. So, NLU as the name
states deals with understanding the input given in the form of sentences in
text or speech format.

And this is where the machine analyzes the different
aspects of language. Then we have natural language generation, so NLG deals with
turning raw data into simple human understandable language. Or in other
words, producing natural language from raw input. Now to implement the NLP
concepts, you would need some packages tailor-made for natural language
processing. So, in today's session, we will work
with two such packages which are NLTK and Spacy. So generally, the first
step in NLP process is Tokenization. So when tokenization be basically split
apart text into individual units and each individual unit should have a value
associated with it. So let's take this example, we have a sentence, Joey doesn't
share pizza and we are splitting this up into individual tokens when each word is
taken as a separate token and also the exclamation marks are taken as
separate tokens.

Now, we can use these tokens for another process like parsing
or text mining. So, let's actually head on to our jupyter notebook and learn how
to tokenize a sentence using the NLTK package. So, we'll start off by
importing NLTK and NLTK.corpus. I'll run this, after that we would
require the word_tokenize function from the NLTK.tokenize
package. So, we'll also import this function from NLTK.tokenize,
I'll click on run. So over here we have a string. So after surprising the hosts in
the first test, Sri Lanka made a positive star to the second test.

So this is the
entire string over here and I am storing it in an object named as cricket. Now
what I'll do is, I will pass in this object inside the word tokenize function and I will store the result in cricket_tokens and I will print this out. So this is the part where we are
basically tokenizing this entire string into individual tokens with the
help of this word_tokenize functions. So, I'll click on run right so
we see that, this entire string is split up into individual tokens, "After" as one
token, "surprising" as token, "the" as token, "hosts" as token.

So again we've taken up this
entire string and split this up into individual tokens. Now, after doing that
let's also have a glance at the type of this and the number of tokens we have.
I'll click on run, so we see that, this is a list of all the tokens and we have 201
tokens in total. Now let's also get the frequency of these tokens that is I want
to find out the number of times each token value occurs and for that I would
require the frequency test from NLTK.probability package so I'll import
this and I will create an instance of this with the name Fdist, again I'll
click on run. Now, I will create a for loop where this I value will iterate
through all of the tokens and for each token I will keep on adding the count
wherever it is encountered and finally I will print this out which is Fdist. So, I
get the frequency of all the tokens over here. So, we see that "the" occurs 11 times,
"," occurs 8 times, "a" occurs 6 times, "for" occurs 6 times, "South" and "Africa" occurs 5
times right.

So, this basically is the frequency distribution of all the tokens.
Now, if I want to find out the 10 most common tokens then I have this most
common function, so I'll call Fdist.most common and I'll pass in the
parameter which is 10. So this will give me the 10 most common tokens and I'll
store it in top 10 and I'll print it out, I click on run. So these are the 10 most
common tokens. So, "the" is the most occurring token
which occurs 11 times. Now human beings can understand linguistic structures and
their meanings easily but machines are not successful enough for natural
language comprehension yet.

So, when a machine has to pass through one word at
a time it may not be able to fully understand semantics of a sentence and
that these words when given one at a time are known as Uni-grams. So, let's
take this language over here, "I love natural language" and when each of these
tokens are given one at a time to the machine this is nothing but a unigram "I
love natural language" so these are basically uni-gram's given to the
Machine and similarly when the machine is given two words at a time, it is known
as a Bi-gram.

So, this is a bi-gram, "I love", "love natural", "natural language", so
this entire sentence that is basically being given in the form of two tokens at
a time and similarly when this entire sentence is given three words at a time
it is known as a Tri-gram. So this time, it is split like this "I love natural" is one
part and then we have "love natural language" right. So, this is the tri-gram
which we have over here. So again let's head back to jupyter studio and learn
how to create bi-grams and tri-grams. So, here we have the sentence, "Did you know
there was a tower, where they look out to the land, to see people quickly passing by"
and I store this in black smoke so I will just quickly hit on run now I will
tokenize this sentence with the help of the word tokenize function and I will
store this result in black smoke token and then I'll print it out, I'll click
run.

So these are the individual tokens from this entire sentence which we have
over here. Now, if I want to create bi-grams and tri-grams from this we have the
NLTK.bi-grams and NLTK.tri-grams function. So I will use this NLTK.bi-grams function and then pass in these individual tokens inside this
function right. So, I am passing this black smoke token object inside the NLTK.bi-grams function and I want to see the
result in the form of a list. So, I click on run right.

So, now these individual
tokens are being represented in the form of bi-grams, "did you", "you know", "know there",
"there was". Similarly, if I want to represent these tokens in the form of
tri-grams, I will use the NLTK.trigrams function and then pass in this
object which is black smoke token, again I'll hit on run and now we have trigrams,
"did you know", "you know there", "know there was" right and finally, we have
something known as n-grams. So,n-grams basically has an additional parameter
over here so if I want four words together I'll put in four, if I want five
words together I'll put in five, similarly if I want then words together
I'll just put in ten over here. So, for that I'll have to use the NLTK.ngrams
function and the first parameter inside that would be the list of tokens
after that the parameter which would tell us the number of votes together and
this time I want to see four words together I'll put in four and then I'll
click on run, right.

So over here, we have four words together.
So now let's understand stemming. So, stemming is the process of reducing a
word to its base form and this is done by cutting off the beginning or ending
of the word. so this indiscriminate cutting will be
successful in some occasions and fail in other so let's take these three examples
over here so this first is Studies we have cut off the suffix es and become
study and the second case giving we have cut off ing and it becomes G IV and in
the third case we have cut off ing from buying and it becomes PI so in all three
cases we see that only in the third case we have a word which makes sense so when
they're implementing stemming it is always not necessary that the final stem
the word which we get should have a meaning associated with it now there are
many stemming algorithms available so one such stemming algorithm is porter
stemmer and we'll be importing this potus demo from an LD k dot stem package
after that I will create and instead of this porter stemmer with the name PST
i want to stem these three woods over here winning studies and buying so i
will pass in these three words inside PST dot stamp so PST is the porter
stemmer instance and this instance basically has the stem function so i
want to stem the word winning the word studies and the word buying i'll hit on
enter so I just have to run this first now I
click on run again right so see that Wenning has been stemmed to win studies
has been stem to study and buying has been stemmed to pipe now similar to
stemming we also have limitation so limitation is the process of reducing
words into their lemma or dictionary so the base form into which the word is
converted should definitely have a meaning associated with it so again we
have three cases over here Studies is converted to study which obviously has a
meaning given has been converted to give similarly buying has been converted to
PI so see that these three individual these words have meaning associated with
them now to implement climatization with the help of analytic a package you would
need the word net and word net limit Iser so we will be implementing these
two from and ltk taught stem packages and i will create an instance of this
word net limit Iser now what I want this I want the lemma forms of these three
words Ghats cacti and geese and I will store this in the subject words to stem
so after that I will basically create a fur loop which will go through these
three words and lemma ties each of them I'll hit on run right so see that cats
has been limit ice to card the limit eyes of ocean of cacti as characters and
finally the limit eyes version of geese is goose now another important concept
of natural language processing is parts of speech tagging so in the English
language words can be considered as the smallest elements that have distinctive
meanings and based on the use and functions
words are categorized into different classes which are known as parts of
speech so he was tagging is the process of marking or tagging every word in a
sentence do its corresponding power of speech so let's take this sentence Joey
loves pizza now we can mark each of this wood with a POS tag so the word Joey is
a noun loves is a work and similarly there's a
isn't known again so we'll be adding POS tax to this sentence over here what do
you mean I don't believe in God I talk to him every day and I've stored this in
this object piece now again first task would be to tokenize
this sentence so I will use the word underscore tokenize function and pass on
this object and I'll store the result in peace tokenize so let means acute these
two now I will start a for loop which will fighter ate through all of the
tokens and for each of the token I will add a POS tag with the help of this POS
tag function so I will use this NLT KDOT POS tag and a dab you start to each of
the individual tokens from this piece tokenize list I'll click on run right so
we see that each of these individual tokens has been marked with a POS tag so
what has a POS tag of W which pronoun do has a POS tag of Group U has a POS tag
of personal pronoun and mean has a Bua stag of known right and similarly we
have these POS tags in the rest of the tokens now let's also add POS tags for
another sentence over here so the sentences mainly had a little lamb' whom
she really loved and we are storing this in the object Mary now again I'll use
the word under school tokenize function and pass in this object and I'll store
the result in Mary underscore tokenize after that I'll use this for loop which
will iterate through all of the tokens and add a POS that to each of the tokens
I'll hit run right so Mary is a proper noun
add is a world II is a determiner and similarly we have other POS tags
associated with each of the words winghead will understand what is named
entity recognition so named entity recognition as the process of taking a
string of text as input and identifying relevant noun such as people leases and
organizations that are mentioned in that string so in the sentence I will by
Snooky startup for 1 billion dollars Apple has been recognized as an
organization UK has been recognized as a geopolitical entity and 1 billion
dollars has been recognized as money now one thing to be noted over here is all
these three words are nouns so named entity recognition is done only on nouns
and to do named entity recognition we would require this any underscore chunk
function so we'll import this from the NL DK package I'll hit on run and we
were the one the named entity recognition for this sentence over here
John lives in New York I'll show this in John again the first task is to tokenize
this sentence so I'll use the word and the school tokenize function and pass on
this object John inside this I'll store it in John underscore token I'll click
on run right so we have to organise this after that we will add a Buist tag with
respect to each of these tokens so I'll use the function and LD k dot POS tag
and pass in these tokens and I will store this in John tags and I'll print
this out right so we have a POS tag associated with each of these tokens
John as a noun lips is a verb and is a conjunction right
now finally it's time to get the named entity recognition of these POS tags so
I will use the any underscore chunk function and pass in this John
underscore tags object inside it and then store it in John underscore
NER after that I'll print it out right so see that John has been recognized as
a person similarly New York has been recognized as a geopolitical entity
again the thing to be kept in mind us with the named and
mission is done only announced John is a known New York is a known right that is
why John has been recognized as a person in New York has been recognized as a
geopolitical entity so all of those stars were done with an LD K now we also
have specie which is a relatively new framework in the Python natural language
processing environment but it will quickly clean ground and will most
likely become the de-facto library for natural language processing so specie is
written in size and language Dallas C extension of Python which gives C like
performance to the Python program so let's go ahead and improve the specie
package now this PC package has different models so we'll load the en
core web small model and store it in an LP object I'll hit on run right now
using this NLP object I can create different documents so this documents
basically contain strings so each string can be classified as a document so I
will store this is pata inside this document I'll click run now the
organization is very simple when it comes to a specie package so all I have
to do is start a full loop and print all of the tokens present in that document
so fur token and dog so it will ITER it through all of the tokens present in the
talk and print the text of all of the tokens
I'll hit run right so these are the five tokens which are present in the string
this is beta and then we have two exclamation marks now after that let's
say if I want to get the third token from this so if I want to get the third
token then that is present at index number two of this document so zero one
do so that is why I'll pass and talk and I will give in the index number we'll
just do and I'll store this in token right so the third token Esparta
similarly if I want all the tokens present that second third and fourth
indexes so I can use this so inside parentheses of dog I will start that do
and I'll end it fight so this will basically give me all the tokens
starting at second index and ending at the food index now let me go ahead and
print the token index and the token takes together so this is a document
this Espada now in this print command I am first printing in the token index
which I get with the help of token dot I after that I am also printing the tokens
text which I get with the help of token dot txt right so I have the token index
and I have the corresponding token takes two each of these index values now
similarly let me also do POS tagging with this PC package so for that the
document has Joey doesn't share pizza and I will either ate through all of the
tokens and this document first I will print in the index of the token after
that I will print in the text of the token and finally I will print the POS
tag of these tokens right so we have the index value we have the token text and
we also have the POS tag corresponding to each of these tokens similarly we can
also do named entity recognition so for this we have the string Apple is looking
to buy UK startup for 1 billion dollars and obviously I have stored it in the
stock object I click on run and over here we have this for loop for E and D
in dock dot E and D S will print out the entities text and also the corresponding
entity label so basically dot dot E and D s comprise of all the entities and the
corresponding labels with respect to the entities recognized
so I'll hit on run right so this dog dot NDS has recognized these three entities
Apple has been recognized as an organization Yuki has been recognized as
a geopolitical entity and one billion dollars has been recognized as money
Barack Obama the former president of United States will be freakeating white
house to D I'll click on run again I'll start the for loop and I will print the
entity and the corresponding entity recognition right so Barack Obama has
recognized as a person you know steet is recognized as GPE and today has
been recognized as d and finally we have also something known as mature from
specie so this match sure helps us to find patterns in a string with different
criteria such as up understanding the ulema of the word understanding if the
world is a digit or not understanding of the world as a punctuation mark on odd
so something like that so let's do some pattern matching on the string over you
Barack Obama the former president of United States will be vacating White
House 2d again I will store this in dog so now first I will create a pattern so
pattern is basically created as the list of dictionaries
so each dictionary indicates a word so the first word we are basically trying
to match with a lemma value so the first words lemma should be rocky and after
that it needs to be followed by another word which is white right so I want to
extract a substring from this entire dog which is basically towards the first
word is a lemma variation of wicked and the second word is white now I will
create an instance of macho by passing in an LP dot woke up inside this now
this master instance has this add function which basically takes in three
parameters first parameter is the ID which helps us to uniquely identify this
pattern matching after that the second parameter is basically the callback
function which I have stated as none over here and the third parameter is the
pattern which I want to identify from the string so I've given these three
parameters after that I will pass in the document inside the match of function
and I will store it in matches I'll hit run now this matches object
will basically comprised of three things match Heidi start index and the end
index and I will put these three inside the for loop and get the substring from
this document so I will hit on run right so I have successfully extracted
recuiting white from this entire string over here
so the pattern basically was the first word needs to be a lemma variation of
the gate and it needs to be followed up with the word white now similarly let's
also do pattern matching on this ring over here who doesn't need in FIFA World
Cup France one I'll click run so this time the pattern matching should be
something like this the first would should be a digit second word should be
FIFA third would should be world and the fourth word should be Cup again I will
create an instance of this mature and I'll pass and Peter look up inside this
after that I will use the add function and give these three parameters so IDs
FIFA pattern second parameter is none and the third parameter is basically the
pattern which I want to extract from the string and after that I will use this
match to do and pass in this document inside this and store the result in
matches to I flick run so again I'll use this for loop and extract the substring
from this original string I flick on run and I have successfully extracted the
substring which is 2018 FIFA World Cup from this original string over here
right so this is how we can implement and it'll be concepts with the help of
analytic a package and specie package so now we will see how to do sentiment
analysis using the NL DK package so we'll start off by loading all of the
required packages I'll import an Aldi key I'll import random and we'll also be
doing a bit of classification so for that purpose we would require all of
these classifiers so we'll need a scalar and classifier will be implementing the
naive Bayes classifier a CD classifier logistically aggression and SVC so I'll
be importing all of these SOI importing multinomial NB Banali NB from SK learn
dot now beats and I'll also import autistic regression and a CV classifier
from SK learn or clean your model and also we'll be implementing the support
vector classifier so I'll be important as we see linear SVC and NUS we see from
SK learn dot SVM right and we would also have to do a bit of pre-processing of
the data which we have so for that purpose we would import the word
nice from an LD KDOT tokenize and to load the file we would require the OS
packet so I'll also import that file click on run so we have imported all of
the packages that we need right now it's time to load a dataset so we'll be
performing sentiment analysis on top of the IMDB reviews dataset so this
basically consists of 25,000 reviews and none of these 25,000 reviews 12,500 are
positive reviews and the rs.12000 file attributes are negative reviews so these
two are stored in separate folder so there is one folder called as POS and it
would contain all of the positive reviews and there is another folder
called as any G and it would contain all of the negative reviews all right so
first I will load all of the positive reviews into files underscore POS so
these are the two lines over here so what I am doing is I am reading every
single review as a single line so for F in files under school POS what I'm doing
this so for every positive review I'll read it and I'll store it in two files
under score POS right so files and the score POS would have all of the positive
reviews and similarly I'll do the same thing so all of my negative reviews are
stored in this energy folder so this is the entire path of D Drive and the
folder which has the negative reviews as energy and I will keep reading every
single review and I'll store it in files underscore energy I'll click on run so
this is a huge deal asset which comprise of a lot of files so now I have my
positive reviews and my negative reviews let me find out the length of both of
these so as I told you guys there are 12,500 positive reviews which are stored
in files and the scope POS and similarly there are 12,500 negative reviews which
are stored in files and the score energy right so now we have a positive reviews
and negative reviews with us now if we want to implement all of the classifier
algorithms and if you want to do processing on top of all of these files
then it would consume a lot of time right so what I'll do is I'll just take
a subset of this so from the 12,500 power
the reviews I'll just select the first thousand positive reviews and similarly
from the 12,500 negative reviews I'll just select the first thousand negative
reviews and I'll store them into the same data sets so now files and the
scope EOS has the first thousand positive reviews and files underscore
energy has the first thousand negative reviews I'll click on run right so this
is done just to verify let me again have a glance in the lengths of files POS and
files in eg so we see that there are thousand and thousand right so there are
thousand positive reviews and thousand negative reviews now it's time to
pre-process all of these reviews right so I'll start by creating empty lists so
here's the first empty list as all words and second empty lists as documents and
I'll import stop words from NLT KDOT corpus so my first task in
pre-processing would be to remove all of the stop words so I'll just set over
your list set stop words dot words and I will get the list of all of the stop
words in English and I will store all of those stop words in this object stop
underscore words right now this is done and what I actually want now is bag of
words or bag of adjectives because adjectives are a better way to
understand the sentiment of a review right so let's say there are some
adjectives such as great good or bad right these are all of the words or
these are or adjectives are the main words which tell us about the sentiment
of a review and that is why I would want only the adjectives from the entire
review so now during POS tagging each of these parts of speech have a different
letter assigned to them such as adjectives as denoted with J adverb is
denoted with R and verbals you noted with V so since I want only the
adjectives I will set over here that allowed word types are only chi or in
other words I want only the adverbs from all of the parts of speech right so this
is the initial pre-processing now I'll start this for loop and it goes through
all of the reviews in files underscore POS so for P in files and the Scorpios
which with forced review and goes on in the
thousand positive reviews so now I have this empty documents list over here and
using this for l-lois I'll actually create a list of tuples where the first
element of each tuple is a review and the second element is the label of the
review right so P it is basically for P in files underscore POS so over here if
we start in the for loop then it will have the first review and over here POS
basically means that this review is positive again this will iterate and
then we will have the second review and again the second review is also positive
then we'll have the third review and again we'll set the label which is
positive again right so this will go on inside this loop now once we create this
list of tuples we also have to remove the punctuation present and the reviews
so I'll pass in P over here and this is the syntax to remove all of the
punctuation right and that is how I'll remove all of the punctuation so now
I've removed the punctuation z' now it's time to also tokenize my words so
tokenizing words is very important so each individual word or each individual
element present in all of these reviews would be tokenized
and for that purpose i would have to use the word underscore tokenized function
and i will pass in this cleaned object into this so this cleaned object is that
object which does not have any punctuation I'll pass it inside the word
and the score tokenize now once it is tokenized I will store the resultant
tokenized right so now we have the tokenized result and they would have to
remove all of the stop words from all of these tokens so let's understand this
over here so W for W in tokenized if not W in stop words
so this basically means that if the token is not present in stop words then
go ahead and store it in to stop and this loop will continue and we'll get
only those tokens which are not present and stop words right so we have
successfully removed all of the stop words and I will pass this object inside
and LT cannot pause tag function that purpose I can do the POS
tagging of all of the tokens right so I have all of my tokens with me and I'll
just do the POS tagging other parts of speech starting for all of the
individual tokens and I'll store it into POS and this is where I have all of the
tokens associated with parts of speech tag now I had only told you earlier from
all of these parts of speech I would want only adjectives I do not want all
of the parts of speech so over here inside this loop what I'll do is I will
just select only those tokens which are present in allowed word types or in
other words I am extracting only those tokens which are adjectives and I will
store all of them in all words right so now all wood has all of the positive
adjectives present in it right so we have stored all of the positive
adjectives into all words and we'd have to do the same thing for all of the
negative reviews so this time I'll start the loop for all the negative reviews so
this will go like this for P and files underscore neg and over here I am again
creating the list of tuples where the first element of each tuple is the
review but this time this is the negative review right so this is the
negative review which is present in files underscore any G and I will also
associate a label with it which is any chief this time right since over here
this is a negative review I have the Devi over here and the label is energy
which tells us that this is a negative review now this is done I would also
have to remove all of the punctuation is present inside this negative review so
I'll just type in re dotsub and this is how I can remove the negative
punctuation and then I will go ahead and tokenize the words so I'll pass in this
clean object into word tokenize and we'll store all of the tokens in
tokenize now it's time to remove the stop words so W for W and tokenized if
not W in stop words this means that this loop will go through all of the stop
words and if the token is not present and stop words only then that token
would be stored in this table the object or in other words I would want
only those tokens which are not present in the stop words object so I have
removed the stop words now it's time for parts of speed stacking so I will pass
this dogged object into an LD k dot POS tag and this helps me to do parts of
speech tagging to all of the tokens now for sentiment analysis I would require
only the adjectives and not all the parts of speech so again I will start
this for loop this full-up will go through all the tokens present in energy
and it will extract only the negative adjectives until happened to all words
so initially in the above for loop we had all of the positive adjectives and
with the help of the build or for loop we have all of the negative adjectives
appended to all words so now this all words is a combination of all of the
positive adjectives and all of the negative adjectives right so now we have
our bag of words or bag of adjectives with us so now I'll go ahead and create
a frequency distribution for all of these adjectives and for that purpose
I'd have to use an LD k dot feed list and I'll pass in the all words object
inside this I'll click on run and this is the frequency distribution so see
that the adjective good occurs 1157 times bad occurs 743 times greet occurs
660 times much occurs four and 82 times and so on now let me also visualize this
graphically so I'll import matplotlib and I'll just make a simple line plot
for this so this is our line plot and we see that good is the adjective which
occurs most number of times over here and it occurs around eleven hundred and
fifty seven times now out of all of these adjectives I would want the
thousand most frequently occurring adjectives right so I'll just extract
the thousand most frequently occurring adjectives so I'll type in all words dot
keys and I'll extract the thousand or the first thousand adjectives and I'll
store it into word features right so these are the thousand most frequently
occurring adjectives over here so now we will actually have to create a tuple
where the first element has a dictionary and the second element is the label
positive or negative so to create the dictionary of features we'll actually
use this user-defined function so what we'll do is we are supposed to create a
dictionary of features for each review present in the LES document right
so for each review present we will create a dictionary of features now in
this dictionary the keys are all of the words in word features and the values of
each key are either true or false which basically tell us whether that
feature appears in the review or not and it basically looks something like this
so we have a tuple the first element of the tuple is a dictionary and in that
dictionary the key basically comprises of all of the adjectives and the value
is basically true or false which tells us if the adjective is present in the
review or not and then the second element of the tuple has basically POS
or energy which would tell us whether review is positive or negative and we
are doing this because we'd want to find out whether this particular review is
positive or negative so this usually fine function which you see this helps
us to do something like that now all of that we will create this feature set so
this feature set is nothing but our tuple where the first element is a
dictionary and the second element is basically the label which is POS or any
G and then I will go ahead and shuffle these feature sets and I will divide
these feature sets into training and testing data so the training data would
comprise off 800 reviews and the testing data would comprise of 200 reviews so
the first 800 features go into the training set then last 200 features go
into the testing set I'll click on run now let me have a glance at the first
feature set right so this is our dictionary over here which is the first
element so what we see is this so this is the adjective and this false
basically tells us that this adjective is not present in the review so let's
actually find for true right through and it is BR so this tells us that BR is
present in the review let us find for some other true value quite better as
present in this review let me also find some other true label let's see where do
we get our true label down right so over here we have mediocre so mediocre as
present in this review so we have Falls it basically means that
those adjectives are not present in the review and wherever we have true it
means that those adjectives are present in the review now let me go to three
down and let me see what is the level so this is pause right so combining all of
the adjectives which are present in the review when combined and they give us
the positive label and this is the training set now what I want us I would
build a couple of classifiers on top of this training set and once the learning
is done I will try to predict the values on top of the test set and see whether
the prediction is correct or wrong so the training will be done on this
training set it learns all of these features it learns the levels once that
is done I'll flick the values on top of the test set so I'll start off by
implementing the naive Bayes classifier so I'll just type in NLT KDOT naive
Bayes classifier and I will train it on top of the training set and I will check
for its accuracy on top of the test set and I'll just print it out and after
that I would also want to show the 15 most informative features from this
classifier so I'll click on run so we see that we get an accuracy of seventy
five point five percent when we build this model on top of the training set
and these are the most infamous right so powerful cereal sheep Unni so these are
the 50 most infinitive features when we build this classifier and on the right
side what you see over here is the ratio of positive to negative as 11.3 to 1.0
for this adjectives this means that the world powerful is eleven point three
times more likely to occur in a positive review than a negative review and this
is for cereal and this is negative to positive this means that the adjectives
cereal is eight point one times more likely to occur in a negative review
than a positive review again we have cheat so over here this word Chi is six
point five times more likely to occur in a negative review
in a positive review and again it's take the super magic to Ohio so this means
that the word superb is 4.9 times more likely to occur in a positive review
than a negative review so we have implemented lean Ivy's classifier on top
of the training set and we've got an accuracy of seventy five point five
percent now we'll go ahead and implement some other classifiers so we'll start
with multinomial naive bayes then we'll implement binomial nabis then
logistically aggression after that SGD classifier which is basically a
stochastic gradient descent and finally we will also implement SVC which is
support vector classifier so let me input all of these packages first right
so we have imported all of these packages now we'll implement the first
classifier which is multinomial envy so I'll just create an instance of this and
I'll store the instance in mnb CLF and I will train this classifier on top of the
train set and I'll store it in M and B CLS now since I've built the model on
top of the train set I can go ahead and predict the values on top of the test
set so again I'll just type in an LT k dot classify dot accuracy so I want the
accuracy you are my predicted results are stored in heaven we underscore CSL
and my actual values are in the test set which is basically the testing set let
me plan this out so we have built the multinomial naive bayes classifier and
we've got an accuracy of seventy seven point five percent now we'll build the
bond omean I base classifier so I'll just create an instance of binomial nie
B's and store it in P and B underscore CLF and I will fit this model on top of
the train set and store the result in B and be underscore CSL now again since I
fit the model on top of the train set I can go ahead and predict the values on
top of the test set so I'll use NLT k dot classify dot accuracy and over here
this takes in two parameters first parameter R is basically the object
which contains all of the predicted values which is B and we underscore CLS
and this is my test set which comprised of the actual values so let me click on
run and this time I get an accuracy of 75% with the panel in high base now I'll
implement the logistic regression algorithm
so I create an instance of it and I'll name that instance to be log Greg
underscore CLF and I'll fit the model on top of the training set again and I'll
store the result in love and a score CLS similarly I'll get the accuracy by
fitting the values on top of the test set so with the logistic regression
algorithm I get an accuracy of 72 percent now we've got two more
classifiers left so this time we are going to implement the SGD classifier
which is the stochastic gradient descent classifier and I'll name the instance of
this to be SGD underscore CLF and I'll fit the model on top of the training set
and store it in szd underscore CLS and I will predict the values on top of the
test set I will click on run right so with the stochastic gradient descent
classifier we get an accuracy of 68% you see that the accuracy keeps on
decreasing over here right so we started with multinomial naive bayes which give
us an accuracy of seventy-seven point five then Bernoulli naive Bayes gave us
an accuracy of 75% long stick regression given accuracy of 72 percent GD gave an
accuracy of 68 percent now let's see what accuracy the SCC classifier gives
so I'll click on run so the support vector classifier gives a
very very low accuracy which is just forty nine point five percent so now if
I compare all of these accuracies then I see that I get the highest accuracy with
the multinomial naive bayes classifier and the accuracy is seventy-seven point
five percent so we can basically choose this model over all other models when we
are trying to ascertain the sentiment of these words right so this is how we can
do sentiment analysis using the analytic a package and the scikit-learn packet
now let's head to the quiz so we have this sentence this Apsara now what do
you think how many by grams can be generated from the sentence so these are
the options think about it and write down the answer in the comment section
so guys this brings us to the end of the session and do subscribe to in del
Apache YouTube channel for more such intimate

As found on YouTube