(siraj) Hello, world!
It's Siraj, and
we're going to make an app that reads
an article of text and creates a one
sentence summary out of it using the power of
natural language processing. Language is in many ways
the seat of intelligence. It's the original
communication protocol that we invented to
describe all the incredibly complex processes
happening in our neocortex. Do you ever feel
like you're getting flooded with an increasing
amount of articles and links and videos to choose from?
As this data grows, the
importance of semantic density does as well.
How can you say the
most important things in the shortest amount of time?
Having a generated
summary lets you decide whether you want to
deep dive further or not. And the better it
gets, the more we'll be able to apply it to
more complex language, like that in a scientific
paper or even an entire book. The future of NLP is
a very bright one. Interestingly enough, one of the
earliest use cases for machine summarization was by
the Canadian government in the early 90s
for a weather system they invented called FoG.
Instead of sifting through
all the meteorological data they had access
to manually, they let FoG read it and generate
a weather forecast from it on a recurring basis.
It had a set textual
template and it would fill in the values
for the current weather given the data,
something like this. It was just an
experiment, but they found that sometimes
people actually prefer the computer generated
forecasts to the human ones, partly because the
generated ones use more consistent terminology.
A similar approach has
been applied in fields with lots of data that
needs human readable summaries, like finance.
And in medicine, summarizing
a patient's medical data has proven to be a
great decision support tool for doctors.
Most summarization tools in
the past were extractive, they selected an existing
subset of words or numbers from some data to
create a summary. But you and I do something a
little more complex than that. When we summarize,
our brain builds an internal semantic
representation of what we've just
read and from that, we can generate a summary.
This is instead an
abstractive method and we can do this
with deep learning. What can't we do with it?
So let's build a
tech summariser that can generate a headline from
a short article using Keras. We're going to use this
collection of news articles as our training data.
We'll convert it
to pickle format, which essentially
means converting it into a raw bytestream.
Pickling is a way of
converting a Python object into a character stream.
So we can easily reconstruct
that object in another Python script.
Modularity for the win.
We're saving the data as a tuple
with the heading, description, and keywords.
The heading and description
are the list of headings and their respective
articles in order. And the keywords
are akin to tags, but we won't be using
those in this example. We're going to first tokenize,
or split up the text, into individual
words because that's the level we're going to
deal with this data in. Our headline will be
generated one word at a time. We want some way of representing
these words numerically. Bengio coined the term
for this called word embeddings back in 2003,
but they were first made popular by a team
of researchers at Google when they released word2vec,
inspired by Boyz II Men. Just kidding.
Word2vec is a two layer neural
net trained on a big label text corpus.
It's a pre-trained
model you can download. It takes a word as its
input and produces a vector as its output, one
vector per word. Creating word vectors lets us
analyze words mathematically. So these high
dimensional vectors represent words
and each dimension encodes a different property,
like gender or title. The magnitude along each
axis represents the relevance of that property to a word.
So we could say king plus
man minus woman equals queen. We can also find the
similarity between words, which equates to distance.
Word2vec offers a
predictive approach to creating word vectors,
but another approach is count based.
And a popular algorithm
for that is GloVe, short for global vectors.
It first constructs a large
co-occurence matrix of words by context.
For each word, i.e.
row, it will count how frequently it sees
it in some context, which is the column.
Since the number of
context can be large, it factorizes the matrix to
get a lower dimensional matrix, which represents
words by features. So each row has a feature
representation for each word. And they also trained it
on a large text corpus. Both perform similarly well,
but GloVe trains a little faster so we'll go with that.
We'll download the
pre-trained GloVe word vectors from this link and
save them to disk. Then we'll use them to
initialize an embedding matrix with our tokenized vocabulary
from our training data. We'll initialize it
with random numbers then copy all the GloVe weights
of words that show up in our training vocabulary.
And for every word outside
this embedding matrix, we'll find the closest
word inside the matrix by measuring the cosine
distance of GloVe vectors. Now we've got this
matrix of word embeddings that we could do so
many things with. So how are we going to use
these word embeddings to create a summary headline for a
novel article we feed it? Let's back up for a second.
[INAUDIBLE] first introduced
a neural architecture called sequence to sequence in 2014.
That later inspired
the Google Brain team to use it for text
summarization successfully. It's called sequence to sequence
because we are taking an input sequence and outputting
not a single value, but a sequence as well.
[SINGING] We gonna
encode, then we decode. We gonna encode, then we decode.
When I feed it a book,
it gets vectorized, and when I decode
that, I'm mesmerized. So we use two
recurrent networks, one for each sequence.
The first is the
encoder network. It takes an input
sequence and creates an encoded representation of it.
The second is the
decoder network. We feed it as its input that
same encoded representation and it will generate an output
sequence by decoding it. There are different ways we
can approach this architecture. One approach would be to let
our encoder network learn these embeddings from scratch
by feeding it our training data. But we're taking a less
computationally expensive approach, because we already
have learned embeddings from GloVe.
When we build our
encoder LSTM network, we'll set those
pre-trained embeddings as our first layer's weights.
The embedding layer is
meant to turn input integers into fixed size vectors anyway.
We've just given it a huge
head start by doing this. And when we train this
model, it will just fine tune or improve the
accuracy of our embeddings as a supervised classification
problem where the input data is our set of vocab words
and the labels are their associated headline words.
We'll minimize the cross-entropy
loss using rmsprop. Now, for our decoder.
Our decoder will
generate headlines. It will have the same LSTM
architecture as our encoder and we'll initialize
its weights using our same pre-trained
GloVe embeddings. It will take as input
the vector representation generated after feeding in the
last word of the input text. So it will first generate
its own representation using its embedding layer.
And the next step is to
convert this representation into a word, but there is
actually one more step. We need a way to decide
what part of the input we need to remember,
like names and numbers. We talked about the
importance of memory. That's why we use LSTM cells.
But another important aspect of
learning theory is attention. Basically, what is the most
relevant data to memorize? Our decoder will generate
a word as its output and that same word
will be fed in as input when generating
the next word until we have a headline.
We use an attention mechanism
when outputting each word in the decoder.
For each output word,
it computes a weight over each of the
input words that determines how much
attention should be paid to that input word.
All the weights
sum up to 1 and are used to compute a
weighted average of the last hidden layers
generated after processing each of the inputted words.
We'll take that weighted average
and input it into the softmax layer along with the last hidden
layer from the current step of the decoder.
So let's see what our model
generates for this article after training.
All right, we've got this
headline generated beautifully. And let's do it once more
for a different article. Couldn't have said
it better myself. So, to break it down, we can
use [? retrained ?] word vectors using a model like GloVe easily
to avoid having to create them ourselves.
To generate an output sequence
of words given an input sequence of words, we use
a neural encoder decoder architecture.
And by adding an attention
mechanism to our decoder, it can help it decide what
is the most relevant token to focus on when
generating new text. The winner of the coding
challenge from the last video is Jie Xun See.
He wrote an AI composer
in 100 lines of code. Last week's challenge
was non-trivial and he managed to get
a working demo up. So definitely
check out his repo. Wizard of the week.
The coding challenge
for this video is to use a sequence
to sequence model with Keras to summarize
a piece of text. Post your GitHub
link in the comments and I'll announce the
winner next video. Please subscribe for more
programming videos and for now, I've got to remember
to pay attention. So thanks for watching.
(siraj) Hello, world!