Success

SPACY’S ENTITY RECOGNITION MODEL: incremental parsing with Bloom embeddings & residual CNNs

Google+ Pinterest LinkedIn Tumblr

Last week, we released version 2 of our
natural language processing library spaCy which gets the library up to date
with the latest deep learning methodologies for solving tasks such as
tagging, parsing and named entity recognition So naturally, we've had a lot
of questions about details about how the statistical models that we've used are
working and why we did things the way that we did. So in order to answer that in
sort of an easy way, I've decided to dust off some slides from a presentation I
gave recently at Zalando Research to take you through the thought
process behind this and also introduce you to, how we think about
what we're doing here and some of the ins and outs and whys and things.
So this is the overview of the activities that we're doing at Explosion
AI. spaCy, as I said, is this open source library for what we call
"industrial-strength natural language processing", and what we mean by that is
really just that it's application-oriented, as opposed to many of the other
libraries and technologies which were much more oriented towards doing
research.

So now that these technologies, machine learning and natural language
processing, are getting increasingly commercially viable, I think it's
important that people have production-ready implementations of these things
that, instead of giving people a forest of different options to configure
and choose between, there's really just one consistent way of doing
things. And it's kind of a snapshot of the best methodology, or best curation
of ways that things can be done and I think that's actually being very
valuable to people, and we've been very gratified to see the response to this. So
the other thing that we're working on is this library Thinc,
which is what holds the deep learning implementations for spaCy.
This isn't quite ready for independent use, so we don't have independent
documentation for it, but actually, I think there's some
interesting stuff there if you want to check it out. And we're preparing
a more stable release of this. And then, finally, Prodigy, our first commercial
product, is an annotation tool that integrates very well with spaCy.

But you
can also use with other deep learning tools and other libraries. And I
think this is kind of a missing piece in the ecosystem around machine learning at
the moment, because annotation is super important but there wasn't a really
clear workflow or procedure, a repeatable methodology for this, and I think Prodigy brings that to you and
makes the whole procedure much easier. And then, finally, we're also interested
in producing a data store of pre-trained, customizable models that
cover a wider variety of languages and use cases. So we'll have one
pre-trained model for, say, product recommendations on social media and
another pre-trained model for legal text in German or something like
that. I think that this will help people really get started in a wider
variety of tools and problems, with models which are appropriate for
the use case that they're dealing with.

The brief bio of me and Ines,
the other half of Explosion AI. I've been working on natural language
processing since basically my whole career. I finished my PhD in
2009 and then continued publishing on natural language processing stuff. I left
academia in 2014. These things were getting increasingly
viable and instead of writing grant proposals I thought that it would
be more satisfying to take these things and actually get people using the
technologies that I've been working on for so long.

Ines also has a background
in linguistics and she's the other lead developer of spaCy and she's the lead
developer Prodigy. In addition to working on the libraries, she does the
front-end development for the visualizers and basically continues
doing the development on these things as well. Okay, so this is a quite ham-fisted analogy that we use to describe what we're doing. Because work in data science machine learning
can get very abstract we like to hold ourselves to the discipline of
explaining this with a more down-to-earth analogy.

And I think this
is actually quite useful for anybody, even people who are in the field.
So the chosen analogy that we use is kind of like a boutique kitchen.
The free recipes that we publish online are like the open source
software. Then we've been doing consulting work to boostrap
the company, so this is like doing catering for select events.

And then
we've also got this line of what you could say is kitchen gadgets, which is
like our downloadable tool Prodigy, which you can use to enact the recipes
that we have online and basically make them yourself. And we're
also preparing this line of what you could say is premium ingredients, which
you can use to make the recipes. So these are the pre-trained models that
we're talking about. And I think this analogy is nice, because it gets
at this thing which, I think is counterintuitive about open source
software and commercializing or making a business around it. And that's that
it's very natural to see the software or the algorithms that you're producing as
the main value that you're providing, and so it's very natural to
say, well, that's should be the thing that people directly pay you for.

Similarly, a chef
might say their recipes are the main differentiator and value that they're
producing. But there's a difficulty and that's that there are many recipes online
and people don't tend to know if a recipe is good until they've
already cooked it and tried it. And the overall cost of trying a recipe is quite high and so the question is, well, why should I
pay for your recipe that I don't even know is any good, instead of
taking any one of these abundant things. From your internal view of these things,
it makes a lot of sense to say, well, I worked so hard on this
code, it's what people should pay me for. But people won't note that your
library solves their problem until it's in production solving their problem.
The cost of that is quite high and it makes sense.
If we give it to people and build trust, and we're producing good software that people like,
there's all sorts of other opportunities where we can say, well, we're producing other things that you may also find valuable and that may also solve your problem.

And this has been
working very well for us, and I think that this is a nice way of going about
things. The other ways of doing open-source software, like
running consulting or running a support service around it, really put you
in the business of running a helpdesk, and also put bad conflict
between the quality of the software that you produce, the quality documentation
that you produce, and the side business of running as a consultancy, because the
helpdesk business is better the harder to use the software you're
producing is. And so the better you do, the less you get paid.
And I think that's not really a good system for you or your users. So this other
model just accepting that the open source software is one thing,
and it's free, and then there's other things you can do around it that are
decoupled from it, that are not free, is a better way to go about this
and this has been working quite well for us.

I think it gives nice dynamics and
nice aligned incentives and better feels all around. Okay, so more specifically about spaCy, which is what we're talking about today.
This is the free open-source library for natural language processing. It's being used by
lots of companies around the world and I think it's really become kind of the
go-to solutions that people have. One of the main reasons is that it's in Python and
Python has really become a key language for data science and
machine learning work. So having a very high-performance, and well-curated and maintained library in natural language processing has naturally struck a nerve with people and we're grateful to have so many people
using the software and reporting things and essentially
helping out in the community. As I said, the main
thing that I want to get across is spaCy's solution to this
problem of named entity recognition.

So here's a far too short introduction to
the problem, and the results in the field, and also a little bit about
what makes this difficult and why the results on this task have actually been much lower than I would expect, which I think is kind of
interesting. So here's a quick look at what the task is and what the
expected output is here. Essentially, named entity recognition
is the task of tagging proper nouns and numeric entities and things
like that. And this this is really what
makes it a foundational tasks in natural language processing,
because so much of natural language processing is annotations that are sort of
language internal.

So words defined in terms of words defined in other words
etc. I think this has been described as "trying to learn Chinese by reading in
Chinese to Chinese dictionary". It's hard to find a way into that network, but named entity recognition is great because the meaning of something like "Apple" is finally grounded outside of the linguistic network.
The meaning of "Apple" is the company Apple. So you can resolve that to a node
in the knowledge base or something like that. Similarly for a currency entity like
"$1 billion", you can attach a numeric value to
that and say that the meaning of "$1 billion" is in some sense 1.000.000.000
with a dollar sign. It's a numeric entity and this gives you the grounding for all of the rest of the semantics in the task. And that's why
named entities are this sort of beautifully useful task. Even if you're not doing so well at the rest of
language processing – if you're getting relations and if you're then extracting the relations between
2 named entities you can average that over lots of
text and have a useful technology, even if the accuracy of the individual
relations is not so useful.

This is really why it's this sort of foundational,
central task in natural language processing. The slide also shows a
little bit about how you do this in spaCy. We've tried to keep the API and usage
of this as simple as possible. You can just iterate over
the entities in the document and get a Span object which gives you the start and end positions and also gives you a way to iterate over the tokens etc. So, what sort of accuracy can we get at extracting these named entities? Well, accuracy has been improving pretty quickly after a long period of plateau in this
task. Around 2009 ,the best accuracy on this data set, the
OntoNotes 5 data set, was around 83.5 And then, finally in 2014, the accuracy
of named entity recognition systems built on neural networks
started to really overtake this long stable peak of performance.
And then once it started to overtake that, it's been improving pretty quickly. Over the last year in particular, convolutional networks
have done quite well at this usually with some sort of CRF component on top.

The solution in spaCy's model is pretty similar to Strubell et al.'s system,
and the system that we package as en_core_web_lg, which was built on the
GloVe vectors and so is a little bit larger to download, achieves fairly comparable
performance to that system. The en_core_web_sm model, which doesn't
use pre-trained vectors, is a little bit lower in accuracy but still significantly better than what
we were getting in spaCy v1.

To give you that perspective on the
comparison between spaCy v1's linear model and the new neural network model:
the the linear model in spaCy v1 was performing it about… do we have
this here? Ah yes, NER F. The 81.4% figure of the previous
en_core_web_lg model, which essentially used more features and less aggressive L1
regularization, that achieved 81.4% on this task and even the
new small model is substantially better than that – about
25% error reduction, which is very nice. You can also see the
accuracy of the other tasks. UAS refers to the unlabeled attachment score
for the dependency parser. On that task, we're getting around 92%
for the en_core_web_lg model, which is also better than we had on the linear model. You can also see the part of speech tagging
which is up around 97.2%. So immediately you can see that the NER task is surprisingly difficult.

Naively, I would have expected that the task of attaching the syntactic parses, which requires specialist linguist knowledge
to create the annotations, and it's kind of a more nuanced and involved task, you
would think that this should actually be harder than the named entity
recognition task. But in fact, the per link accuracy on the dependency
parse tends to be higher than the named entity recognition task. So it's
worth stopping and wondering why that might be.

Why is it that
this seemingly simple task of just detecting organizations or tagging
currency elements and stuff is surprisingly persistently difficult. What's so hard about named entity recognition? Well I actually think
that there's a point of sociological interest in this that has
meant that progress on this has been a little bit slower than some of the other
natural language processing tasks. And that's that despite
being sort of foundationally important to the field, named entity recognition
research is not actually that fun a thing to do.

So there's a couple of
reasons why it's relatively non-rewarding, and that aspect of it being
unrewarding or, as I say, a bad thesis topic has made progress a little bit
slower than you would expect relative to the importance of it in the field. So if
you look at the characteristics of the problem: it's a structured
friction task which is great. That's interesting and this is gives
you nice, meaty algorithms to inspect. It's also knowledge intensive
which is kind of cool, but also kind of inconvenient because you have to process
large data sets and, horrors of horrors, maybe interact with the database,
which is basically poison to all researchers. And finally the thing
that makes it, I think, especially troublesome is that it's got this
really annoying mix of easy and hard cases. So your algorithm can be really
good, but then it's still gonna get the same 75% that any stupid algorithm
will get, and this gives you this compressed dynamic
range to show that your clever new idea is actually making a big difference.
And then, on top of this mass of easy examples, there's also some examples
which are maybe misannotated, or basically impossible, so if 70% of the cases are ones everybody's getting right, and then
another 10% of the cases are ones which nobody's getting right, then you
work really hard and do things right and then you get some score which is 1%
better than the other person's.

But on the other hand, maybe you
stuff up something easy and somebody else who's just done more to you tuning and giving it more tender loving care gets the same 1%. This really makes it
hard to show that your ideas are working and that also means that it's hard for
the community to see: When are we on the right track? What should we do better going forward? I think that this makes it a difficult
research topic for us. Our normal hill climbing method is
relatively less effective than on something which gives us a
smooth gradient to ascend as researchers. So I think that that's one reason why
progress and named entity recognition has been a little bit slower
than on parsing. So to give you some insight…
at this point I'll start explaining how spaCy's named entity recognizer works
and I actually import a technique from the pasing community, which is relatively
less studied in namend entity recognition but has actually had a couple of people
doing it the same way and this figure that I've clipped here is from one of
these papers from Lample et al.

The overall framework of structured
prediction that I'm using is transition-based. What this means
is that instead of taking the perspective that we're going to have
each word as the object of interest and predict something on it or attach of tag to
the word, which is the normal tagging framework that people of use for
named entity recognition. Instead, we're going to say that we imagine ourselves
as a little state machine and we start off in a condition where we've got no
output attached. We've got nothing on our stack and we've
got all of the words in the sentence ahead of us in the buffer. We're this little
state machine and we're going to proceed with looking at
the next word. So the next word is the first one that's ahead of the buffer and
then we've got some universe of possible actions that we can take that will move
us from the current configuration into another configuration,
and that universe of actions can differ depending on the transition
framework that you're doing and this means that this transition-based
approach is very flexible.

You can come up with pretty
satisfying solutions to any sort of structured prediction tasks in natural
language processing with this kind of approach of reading the sentence and
maintaining some state and manipulating that structure with some
universal actions. So I quite like this. I think that it's a good way to frame
these problems and it gives you a good way to flexibly calculate features and
basically tune the structured prediction task to the problem
that you're dealing with. In the case of named entity recognition we
will have some action that starts an entity, so basically, we'll have an action
that corresponds to the beginning move and I've actually found that it's best
to have the actions fix the label at the start of the entity, which is slightly
different from some papers that have used this framework, which fixed the
label at the end, I've actually found that fixing the label at
the beginning and using that to invalidate future actions is actually
better.

So we take an action "begin entity" and that puts this word "Mark"
on the stack. And then from there we can say, what's on next action? Well,
we can shift again and put "Mark Watney" onto the stack and then from there, in
this transition system which Lample et al. use, they make a reduce action
which assigns a label to that of "person". As I said, the
transition system that spaCy is using is actually slightly
different so we would have a transition that's "B-PER", so we would move a
word onto the entity stack and and also label it "PER". And then we would have
another an action that says "L-PER", which marks the next word "Watney",
takes that as an entity and reduces it all in one go, and this transition system
which mattress to be BILUO tag scheme gives you better discrimination ability between the different classes and makes the learning
problem slightly easier.

The way we're doing the transition
framework is really sort of computationally equivalent to the tagging task, but
there's some nice advantages to it In particular, there's a very neat
built-in way to say that some actions are invalid and some actions are
valid. So when we are calculating what action to take next –
if we've started an entity and the next action that gets the best score is "O",
i.e. the next word is outside of an entity we know that that's not
valid and so even if we've got a greedy heuristic as we use in spaCy, and we
don't maintain multiple competing analyses, we know the transition sequence
that we predict will always be a valid sequence because we we have a neat way
of just blocking out those actions and building that into the framework. We also
have a neat way of writing arbitrary feature functions, which I'll describe
as we go on and which I think is actually very important for
this, and kind of comes for free. The big constraint that people adopt for the conditional random fields thing where they can only look
two words back, is not that appropriate for this and actually, they're
missing out a lot of flexibility from framing the problem as a tagging
task.

Okay, I'll now shift gears quite
abruptly and talk about the statistical model that's used to predict
these transitions, so in other words, how the neural network is structured.
And the way to understand that is really in what I call the "embed, encode, attend, predict" framework. I've got a blog post explaining this, but it this is
like a neat shorthand that I have of unpacking this forest of
different neural network techniques. And I think it's a nice framework for
understanding and intuiting the design of these models and how the
different components and pieces are plugged together.

So the "embed, encode,
attend, predict" framework really just means that we start off trying to
represent words, and then, after we've got our representation of the words
from the dictionary, we naturally then want to look at the word in context and
come up with and recalculate the main presentation based on that.
That's the "encode" step. Then, after we've got this representation of each
word in context and we've got this two-dimensional shape for a piece of text, like a sentence or a document, we then want to
come up with some summary vector of it and condense that down into one piece
of information that then we can predict something from.

That's why I say that this playbook for natural language processing and deep
learning has these four components "embed, encode, attend, predict".
There are different modules that are being developed for each one of
these. But if you understand it this way, then we can see, ah, okay, I'll plug that in
here or I'll try this, or I'll swap that out and becomes a lot less confusing.
So I'll go through these in detail. As I said, the way to
think about this is to zoom out a little bit and to think of data shapes and
transformations between those data shapes rather than the details of an
application.

So, the basic data shapes that we're manipulating are a really
small inventory of items, and I find this very satisfying. We have integers for category labels, which can be words but also the output identifiers that
we're predicting. They can also be part of speech tags, labels or anything that's
basically a discrete ID. And then we have a vector for single
meaning, usually these are relatively short – so 64 or 120, or 300
units long. It lets you represent the IDs in a way that you can conveniently do similarity operations
and essentially feed them forward in a neural network. Then, another
data shape type that we have is a sequence of vectors. What I mean
by this is that usually in language, the linear order of the IDs that we're dealing with
is very important, obviously. If you're dealing with words or characters or something, you can't just shuffle
them up and get the same meaning representation. But we do
want to look those things up into a static table. So there's this
intermediate state that we can be in, where we have a vector for each
of the words but the vectors were assigned conditionally independent
of each other, even though we know that there's conditional
dependencies between them.

This is where this matrix view comes in,
where we basically get to recalculate the sequence of vectors into another form
which takes into account the fact that
there's this ordering constraint or ordering effects, and that's been
a big help to natural language processing. I think having that sort of unit has
really been one of the main contributions or one of the main things
that's made neural on network methods work better than the linear models. The embed step, as I say, is this step of learning dense embeddings, so we take an ID for a word or something and we get a vector for it.

Usually, the calculation of this is built onto this insight which we
call the distributional hypothesis, which is often attributed to Firth. Its
"You shall know a word by the company it keeps" And this means that instead of
worrying about this Aristotelian definition process of trying to figure out what the
essential characteristics of some word or category is, you can just look at the
words surrounding it, so you don't have to worry about, okay, a dog
is not a necessarily a mammal or a quadruped It's something
that occurs near the words "furry" and "barks" and "walked" and "friend" and "pet".
Based on this sort of view of the meaning of "dog" you can actually
just look at a bunch of text and you end up deciding that okay, a Labrador,
Retriever, and a Rottweiler are really similar in distribution. And then you can come up
with this view of, okay, I don't still know exactly what those things
are, but I know that they're related and I know that I can process them in
similar ways.

And that's a very useful approximation to have and a very useful
type of technology to have, because it means that we can take into account
all of the text and all of the knowledge that's encoded in unlabeled text and
we're not limited to just the text that we have annotated with specific types of
knowledge that we thought we knew we needed ahead of time. So, how does spaCy do this process of taking IDs and calculating embeddings? Well,
as I said, there's all these ways of calculating pre-trained embeddings,
which take into account just raw text, but additionally we want to have ways of
learning word representations that are specific
to the types of problems that we're dealing with. And this is a
slightly different process, so the solution that we have for embedding in
spaCy is a little bit more intricate than many of the technologies or
libraries that are commonly used for this, especially ones which
haven't been designed so much for natural language processing, and where
the single embedding table is slightly an afterthought.

So what we actually do with this is: The first step is this doc2array procedure,
where we extract four attributes of each token in the document, where the four
columns of this are the norm, an ID for a normalized form of the string, which is basically the lowercased form, but you can
adapt or plug in different feature functions for that and calculate any type
of string transformation that you want. The prefix which, I think, by default
is length 3 in spaCy, and then the suffix, which is length 3,
and in a word shape feature, which basically replaces all the digits 0-9 with the letter "d", the lowercase characters with a lowercase "w", the uppercase characters with an uppercase "W".

This gives you this fuzzy,
zoomed-out shape of the word and this is especially useful for unknown words because if the model hasn't seen the
norm, then it gets to see, ah okay, but that is this type of shape, and so it
gets to come up with a representation of that. Now, having extracted this four column view of the document, after this
feature extraction stage, we have a matrix of numeric identifiers of column
4 and one row per word in the document so the next step is,
for each of those columns, we're going to embed them into a table and the
embedment table uses what's called the "hashing trick". I think some recent
publications are describing this as "Bloom embeddings", which I think is a
pretty reasonable way to do this. The is something which I sort of came to
independently, and I see there's a lot of other researchers are coming to the same
idea, because it is well set up by previous work on
natural language processing So the idea here is just that instead of
having a fixed inventory in the embedding table and saying that all
words outside of that inventory share a single out-of-vocabulary vector,
we're going to mod the IDs into the table, so we've got some long hash string
for, say, the norm and we're going to mod that into, say, 7500.

This means that a lot of the entries will end up colliding
and ending up in the same bucket and so lots of our words will end up with the
same vector representation from that key and the solution to this is just to do
it again. So we calculate another key for that word with a different random seed
and then another and another. In fact, I use four keys. So the the vector for, say, the
norm is the sum of four different buckets in that table and this means
that the vast vast vast majority of the words in our vocabulary are going to end
up with unique representations out of this process because it's super unlikely
that any of our words will collide on all four of these keys.

And this means
that the learning process is harder because it's less sensible, it has to
put together these difficult arbitrary sums but it does mean that no matter what, each word that's distinct is going to
end up with a distinct representation from this process and so we never have
this sort of fixed vocabulary effect and then if we go on to continue to train
a table, the model is always able to learn new vocabulary items. We don't
have the resize the vectors because some word was initially outside of our
representation. So I've been actually relatively generous in the number of rows
these tables because computationally it's a pretty cheap operation regardless
and it doesn't take much memory but you can actually run this with very very few rows.

I've tried a Spanish part-of-speech tagger with like 200 rows in the
hash embedding table and it still learns very well because for Spanish
part-of-speech-tagging in the simple scheme of the universal dependency
parsing, the prefix and suffix do very well and so it turns out that
we don't need many rows and this embedding strategy is able to take that
into account.

For the English processing tasks for spaCy,
1000 or 2000 is totally fine and so the embedding table can be super small and super dense
and this is very useful because it means that we can learn word representations
without having this terrible computational cost of doing an update
over the whole table, and we also don't have to worry about sparse update
strategies and that sort of thing So this is really good. So after
we've come up with an embedding table, a separate embedding for
each of these features, we then concatenate them together, and this is
the notation that I'm using here.

This pipe is function concatenation
which means that the the function which outputs from there will be basically,
we'll take four functions, which each output a vector and combine them into
another function which outputs a vector that's the concatenation of each of
their pieces. Thinc let's you overload operators to basically be combinators
over these models, which I think is a very concise way of defining these
things and doesn't have any computational overhead and you can do it
in a block scope so you don't end up with this crazy overloading when
you're doing other things later on.

So it's not so inconvenient. So basically,
over the scope of a "with" block, you bind operators to arbitrary combinators. The notation that I usually use is the pipe for
concatenation because I find this a nice convenient shorthand. The double, this
sort of "shift forward" operator, I use this for
chaining or piping. So that means "feed forward". So take
the concatenated output of these features and then mix them with a
multi-layer perceptron. I actually use one hidden layer and the Maxout unit. I
find is actually pretty good for tasks which I've worked on but we could easily
use a rectified linear unit or something. I've actually just found that
Maxout has worked better for what I've been doing.

A detail actually
additionally, is: I started using layer normalization over this and I found that
that works very well. I used to use batch normalization but
this causes all sorts of problems which is a rant for another day. Okay, so we take these four features, concatenate them, feed them forward into a
multi-layer perceptron and end up with a 128 dimension vector per word that takes
into account sub-word features and is able to learn an arbitrarily-sized
vocabulary. I find it's very nice. The next step is, after we've figured out how
to embed each word individually, how to come up with a vector for just
one word out of context we then naturally want to have
some way of learning representations for larger phrases taking into account neighboring vectors of neighboring words, and so the
normal way, what people do for this, the by far most popular strategy for this
is long short term memory recurrent neural network units, which are
quite a nice way of reading the text forward and then coming up with a vector
for each item and then reading the text backward and concatenating that all.

So
people have described the BiLSTM hegemony in natural
language processing at the moment. But I actually use convolutional neural
networks for this operation in spaCy and indeed in most of the other natural
language processing work that I've been doing, I've found that convolutional
neural networks were a pretty satisfying solution to this and I'll explain why. So,
first of all, the style of convolutional neural network that's useful for this is
kind of different from the way that people do this in vision. So in vision,
people are mostly interested in using convolutional neural networks for kind
of dimensionality reduction, or for reducing a matrix, what I call a
matrix format, into a vector format. And so you use lots of filters and
there's this question about the stride and stuff. I found this very confusing
and I still basically really struggle to define the operations done,
thinking about in the API in TensorFlow and all of the other
libraries give me. Instead, I really think about this is just extracting a window
of words on either side of the word and this is actually the way that Collobert and Weston – the usage here is actually really similar to Collobert and Weston (2011) "Natural language processing almost from scratch" thing.

It's just that we take into account some
of the more recent upgrades and innovations in neural networks to
basically make the things slightly easier to train and slightly easier to
optimize, so the fundamental building block here is this trigram CNN
layer which takes a window on either side of the word concatenates them
together so that if we start off with 128 dimensions per word you're
going to have 384 dimensions for each word because, with redundancy on either
side.

And then from there, we just use a multi-layer perceptron to take that
input representation and map it down into 128 dimensions. So what we're
doing there is mixing the information from the two words on either
side and our target word to produce an output vector that's of the same
dimensionality. So we relearn what this word means based on its neighbors. And
then as we stack these as we do this process repeatedly we're going to
end up with this interesting effect where the effective receptive field, as they say in vision, grows the deeper you do this. So after
the first repetition of this, you end up with a vector that's sensitive to one
world on either side. Now, in the next iteration of this you're going to ask questions or be sensitive to information in
the immediate neighbors but those vectors are sensitive to information
that's one word on their side.

So in the second layer here you're actually
drawing information from two words to to either side. It's just that that information is filtered through the
neighboring word and it's kind of like, I guess, weaker. So there is kind of a
decaying effect of that information, but it is there if it's important. And then if we continue stacking this process, at this point, at
layer 4, we're drawing information from potentially four words on other side, which
I think is a very satisfyingly large window and I actually don't see much
motivation for having an unbounded length at which we can draw our
information into the word's vector – we're not trying to decide the whole task at
this point, remember the purpose of this unit is to recalculate the word's vector
based on the surrounding context so taking into account the whole document
at this point, I think, really just makes it hard to reason about the flow
of information and what's mattering and what's not. It also gives you ample
opportunity to overfit the information in the training data and end up
being sensitive to all sorts of weird things that you didn't think that you
wanted to be sensitive to.

So I quite like the idea that if somebody
inputs very short text or cut-up text or something, that you know that it doesn't
really matter that they've truncated the text at an arbitrary point, as long as
they're four words in, they're at the same point, at the same sort of status they
would have been if they had a whole document before that. And I think that that's
a pretty useful way to reason about this and it makes the model much more
generally applicable and easier to apply because you know what matters what
doesn't, and that's not true in an in a BiLSTM where you're potentially
– but likely not effectively – conditioning on arbitrary inputs.

So I
find the convolutional neural network kind of satisfying in this respect.
It's also computationally cheaper because you get to process this in
parallel for each of the layers, for each of the words. Finally, the other thing
that's worth noting here is that I used residual connections to so that the
output of each of these convolutional layers is the sum of that output and the input.

Now, the residual connection here has an
interesting effect because it effectively means that the output space
of each of these convolutions is likely to be sort of similar to the
output space of the input. So we're not getting this really fundamental
transformation of the vector space, so that things were all swapped around
and this word meaning is over here and that word meaning is over there,
compared to the input, because you're getting the features that you had in and
feeding that forward. And I think that really helps the network learn not to
mess things up too much from the input context, it's sort of learnt is bias
towards keeping things roughly similar because it's going to get the input
representation it had at the start.

I think that's also a pretty appealing
property for this type of unit where we're trying to recalculate the vectors
based on the context. Okay, so the next type of unit that is in
our little model family, or type of transformation or widget that we
have to manipulate is an attention layer. And this type of unit is a
little bit vaguer than the other ones. Usually, when people are talking about
BiLSTMs, well, there's sort of a family of models that are described as
BiLSTM, but they're all really really similar, like the variations of them are smaller.
But attention is kind of vaguer… There's quite a variety of
mechanisms or neural network models that people have described as "attention", so
I actually like to think of this in terms of this purpose, of taking the
output of something like a by BiLSTM where you have one vector per word
and based on the whole surroundings or the whole unit, calculating a vector
based on it.

And in particular, being able to take into account sort of a query vector
or a context vector with which to do that sort of summary. So in this unit
here, we take an input query vector and one vector per word for each things in
the sentence, and we learn a weighted summary of that. And this has
also been a super important module type of innovation type that has come into
these natural language processing modules. Now I'm slightly abusing terminology here I don't really literally use an attention mechanism but I'd like to think that the feature extraction that I do over the state vectors can be
understood the same way.

So rather than having this type of weighted
sum I manually extract features, and the manual feature extraction also
has a translation layer into the hidden layer and so there is
kind of a weighting here, but I won't get hung up on whether or not
this is really "attention". So the features that I'm using are actually less
satisfying than I sort of want at the moment but this is what I've
ended up with. So we take the first word of the buffer, the word immediately
before it, the word immediately after it, and then, I think it's the last
word of the previous entity that we decided – ah, there's is surely a mistake on the slide here because we can't be looking one entity forward, because we read the
document in linear order, but it's essentially the last couple of entities
and I think I take the the vector assigned to the first word of the
previous entity, the last word of the previous entity and then maybe just the
last word of the entity before that.

The thing to note here is that these
features that consider the previous entities, these can be arbitrarily far back
in the document, so it doesn't matter whether the previous entity that we
assigned was 100 words back – we can still be conditioning on that. And this is
quite different from a CRF model which is bounded in the number
of previous decisions that you can condition on, which I think is kind of a
bad limit. What we give up from this is this sort of fancy Viterby decoding.
Instead, we just have this greedy processing. We can use beam search if we
want and I think that works pretty well but greedy search is working
well in spaCy at the moment and because of this, we can write arbitrary
feature functions and I think that that's a very nice win that has
been demonstrated from the dependency parsing community, that
it's really quite worth being able to write arbitrary feature functions and
it's worth giving up the dynamic programming in order to achieve that effect.

And then finally, the sort of simplest
type of unit which people are most familiar with is this prediction layer
which is some of this basic multi-layer perceptron on type process.
Putting it all together, in a slightly pseudocode form,
this is what the the overall parsing loop looks like. So we start off
by getting a tensor. Inside this function we embed the words
of the document, we feed that into the trigram CNN so that at the end of this,
we have a tensor that has one row per word in the document, and the vectors those rows
consider the context of the word so it is sensitive to phrases and that sort
of thing.

Then we initialize our state and we start stepping through this
procedure, where, as long as we're not in a finished state, which in the context
of the entity recognizer means that we're not at the end of the buffer,
we calculate the feature IDs for the state – oh actually I missed a step here –
so, in addition, after we get the tensor we then pre-compute the "attention weights" so we pre-compute the first hidden layer.
So we take these sensors and multiply them by the first hidden layer, so
we've got got this nice ready format that we can just pick out and sum
into and that means that all of this heavy computation happens
here and here we can, while we're stepping through document and working on one word at a time, we have to do much less computation.

And this also
means that in the sort of GPU mode, this phase of the the work happens on
the GPU and then we copy the tensors over to the to the CPU device because
the algorithm that actually steps through the words it's not currently implemented
on GPU. I've got a CPU implementation of it. and this means that this parsing
procedure can happen in a way that shares work better between the the
CPU and GPU. And it also means that even if you're working on purely CPU mode,
you can get pretty decent parsing speeds from this. Certainly not quite as
good as the linear model but better than other systems that are
available. So after we've calculated the features for the state we then use a
pretty basic multi-layer perceptron to get action probabilities and then we
have this procedure which validates which actions are valid given the
state to come up with the action to perform. And then given that best valid
action we perform it and get back the next state and then proceed forward
in the loop.

Okay, so what's the motivation? Why do
this sort of transition based approach which should cosmetically seems more
complicated than maybe the tagging approach – well, as I said, I think that
it's worth that extra complexity because even though the model
framework is mostly equivalent to sequence tagging, and you can reason about it in
a similar way to sequence tagging, I think that the sequence taking approach
basically – because it's encoding these two pieces of information, you're sort of lying
when you're saying that it's just a tagging procedure because it's not,
because you do have to actually perform this logic of taking the tags and
matching them up into entities and that procedure is going to happen outside of
the model and outside the formal procedure that you have.

So there's all
sorts of questions that you have about it, If I have assigned some weight to
an action that's actually invalid, how should they I about zeroing that out?
Or should I learn a gradient on that and just teach the model
not pretty not to predict that and all of this is quite unclear because the way
that you formalize the task as "I've just got these are opaque tags" doesn't actually naturally care
about and I think tht having that closer match between the tasks
that you're actually doing in the way you formalize the task is quite useful
in coming up with ways to frame the problem and optimize it. It's also very
convenient to be able to share code with the parser. So the named entity recognizer
runs the same transition code as parser, just with a slightly different transition sequence. And as I say, you can exclude the invalid
sequences and define arbitrary features. We're still not taking full
advantage of this arbitrary feature definition capability, but I
think that this is a very worthy and very nice thing to have
available once better feature functions are available, and even without the sort of fancy or perfect features that might be developed in future, the framework is
still performing well.

Okay so that's how spaCy's statistical
model solves the named entity recognition tasks. But actually the
prediction machinery for this is, I think, a relatively less important part
of the solution to named entity recognition than you otherwise might
expect. I think actually the thing that will make the most difference in
named entity recognition is making sure that you have training data that covers the
entities that you're most interested in tagging. And I this is also one reason why
so many developers have been so disappointed with named entity recognition
technologies because for instance, the data with which we've
trained the named entity recognition models that we're distributing, is all
from 2010 at latest and actually even in terms of named
entity recognition corpora this is sort of relatively recent. But that's very quickly
out of date so one prominent error that I've noticed in the
models that we've trained is that it tends to get "Trump" wrong, and that's not
something that you would usually get wrong if you were considering any news
articles from recently, because it's such a frequent entity.

So in order to
make best use of these things we need to have ways of flexibly and
quickly updating these models and training them on data sets which cover
the entities in the domain and so I think that this is really the bigger part of the solution to
named entity recognition and making this practical for people and this is why we've
developed a product around this as well So as I said, we need annotations and you
need the annotations that are going to be specific to your task. And the reason
is that you definitely pre-train the embedding layer on unsupervised text and
we can probably come up with better ways of pre-training the convolutional layer –
research on this is sort of only just beginning but it seems clear that you
know we'll be able to get good objectives for this.

We can also pre-train
entities in general so you can have a model like spaCy distributes that tags
currency attributes in general or is good at tagging countries in general and
has a general understanding of people entities and stuff. But we should
definitely fine-tune it on the entities that are in the specific text what
we're dealing with. Whatever stuff that you're doing, if you're predicting a specific thing, like if you're feeding the output
of the entity recognizer forward, you definitely need to train the output
that you're producing.

So if you're creating text categories you need to train output that is on the category scheme that you're interested in. If you're
doing intent recognition you need to train an output layer that recognizes the
intents in your domain, so you need to train that, and no matter what
you're doing, you're going to need evaluation data. Because you know, somebody
like me can produce an evaluation within the corpus, but your question of
how well the model works on your data is separate, and that's not something
that I can answer for, that anybody can answer for. And even if
somebody could tell you what they thought the answer should be, you would
still need to check it yourself. So no matter what you're doing and no
matter how unsupervised your method is, you need evaluation data, and so you
always need to do at least some annotation.

And that's why I think the
annotation is very fundamental and very inescapable. The way that we think about this: we think that the annotation tooling is very important. So
in particular, Ines and I have designed a tool combining a insights from
machine learning and user experience in a way that we hope helps developers train and
evaluate models faster, and the insight behind this is mainly that this works more
smoothly if people are focused on very small, simple tasks that you can do as
quickly as possible. So instead of annotating the whole structure or a lot
of information at once, giving people binary decisions I think makes the
annotation more fluid and efficient.

Alright, so now that spaCy v2 is out
and you've had this brief glimpse at how it all works –what's next? As
I said, I think that better training data and in particular more specific
training data for more languages, more genres and domains is very important
to take in these things to the next step and making sure that they're
actually very useful in practical ways. The other tasks that are sort of
adjacent named entity recognition that are very important are coreference resolution
and entity linking. Entity linking is this task of actually taking
the entity and resolving it into a knowledge base ID. And I think that this is
super valuable because named entity recognition kind of cuts off halfway
towards what you actually want.

The name entity is really
this nice thing because it has grounded semantics and yet the named entity
recognition task still gives you this text label, you still just have this
opaque ball of text that you've labeled as a person. So taking that
next step and actually resolving it to a knowledge base I think makes things much
more useful, and in particular with coreference resolution, you can come up
with this chain of references to one thing, and I think that all three of these
tasks together will enable a nice virtuous self-training loop that will
let us keep these models up to date and extend them with more knowledge in the
future and I think that if we can really get all of these
pieces together and working, then finally people will be able to compute with
these things in practical ways and really have a nice satisfying answer to
this quite foundational problem in natural language processing. So, thanks. I hope you'll get in touch and ask questions on Twitter.

There's also a
subreddit and other places we can interact and basically talk to you about how these models work and what's next
and how you can contribute to spaCy as well. Thanks!.

As found on YouTube