FAQ #1: Tips & tricks for NLP, annotation & training with Prodigy and spaCy

Google+ Pinterest LinkedIn Tumblr

Hi, I'm Ines. I'm the co-founder of Explosion AI, a core developer of spaCy and the lead developer of
Prodigy, our annotation tool for machine learning and NLP. It's been really really
exciting to see Prodigy grow so much over the past year, talk to so many
people in the community, see what they're working on and discuss strategies for
creating training data for machine learning projects. Coming up with the
right strategy is often genuinely difficult and requires a lot of
experimentation. The most critical phase is the early development phase when you
try out ideas, design your label scheme and basically decide what you want to
predict. We've built Prodigy to make this process easier and to help you
iterate on your code and your data faster. In this video I'll be talking
about a few frequently asked questions that have come up on the forum. I'll also
include more details and links in the video description.

In Prodigy, they are basically two
different ways to structure your task. You can make it a binary decision and
stream in suggestions that you accept or reject, or you can make it a manual
decision and ask the annotator to decide between several options or highlight
something by hand. For example spans in a text or
bounding boxes on an image. But how do you decide which one to use? Well, there's
not always an easy answer because every problem is different. But here's a rough
rule of thumb that we found to work quite well. Manual annotation is kind of
the standard way of labelling things. You take your data and just label everything
in order. This makes a lot of sense if what you're looking for is a gold-standard dataset and if your main objective is that you need every single
example in your data labeled. Similarly, if you're creating an evaluation, set you
usually want it to be gold-standard and have no missing or unknown values. If
you're training a new category from scratch you always need to get over the
cold start problem.

Sometimes you can use tricks to make it less painful, other
times the only way is to label enough examples from scratch so you can pre-train the model. If your raw data set is very small, you probably also want to
give manual annotation a try. If there are only a few hundred texts it doesn't
make sense to use tricks here to find the most relevant examples. You can still
you some tricks to pre-select spans or labels but you don't want to be skipping
examples in favor of better ones because there's just not enough there. So these
are all scenarios where a manual interface is probably the best choice.
The binary annotation workflow is useful if you already have something and you
want to collect feedback on it. For example, you can stream in the model's
suggestions accept or reject them and then update the model in the loop.

is what we're doing the active learning powered recipes for instance. Binary
annotation is especially helpful here to improve existing categories on new data.
It also makes it easy to focus and collect information faster and it's
really designed for automating as much as possible. For example, if you want a
label whether news headlines are about sports and if so, which type of sports,
you could of course combine this all into one multiple choice question. But
depending on the task, this can really put a lot of cognitive load on the annotator
and make the process slower and more error-prone. If you're designing it as a
binary task you can go through sports versus not sports first. This is
something you can do pretty quick: sports, not sports, sports. At 2 seconds per
annotation that's almost 1,000 annotations in half an hour.

Next, you can take the sports examples and label the sports
type. That's much easier now because all you have to think about is sports
instead of sports and everything else. In summary, I would say if you can break
down your task into binary decisions, go for it. It's usually faster to annotate
and much easier to do quality control because you'll be able to compare binary
answers. If you're starting from scratch and need to get over the cold start
problem you can try and use patterns to help you pre select entities or use the
manual interface to collect enough data so you can at least pre-train a
model effectively. The same goes for evaluation sets or if your data set is
very small.

If you've used an active learning
powered workflow to improve a named entity recognition model you've probably
come across a situation like this. In order to help you find the most relevant
examples for training, Prodigy will get all possible analyses of a sentence and
suggest you the entity spans that the model is most uncertain about. You can
then accept it if it's correct or reject it if it's wrong. But what if it's
half correct? For example, in this case we want to improve the models "PRODUCT"
category. "iPhone" is suggested as a product entity and it's definitely a
product entity we want to recognize in general.

But here the full span would be
"iPhone X". So should we accept this because iPhone is a product? This is one
of the questions where the answer is pretty straightforward: no this is
something you should definitely reject. Keep in mind that the feedback you're
giving is always on that particular span in that particular context. So if we
accept "iPhone", the feedback the model gets is: "Yes, this was the perfect parse
in this context please produce more like this" which is obviously not what we want.
If we reject the span it's not like we're telling the model "No, iPhone is
never a product entity". We're telling it "In this context, the analysis where
only the token iPhone is labeled as a product is incorrect, please produce less
of this and try again." So we're basically reinforcing other
more confident analyses of that sentence, for instance one that has "iPhone X"
labeled as the product.

It might help to look at the analysis of the sentence in
the BILUO scheme, which is pretty much how named entities are represented
internally. "B" stands for beginning of an entity, "I" stands for inside an entity, "L"
stands for last token of an entity, "U" stands for unit – so basically single
token entities – and "O" stands for outside an entity. For these two analyses
here the BILUO tags are different so we want to be rejecting the analysis of the
tokens "iPhone X" as "U-PRODUCT" "O", update the model accordingly and move
the analysis towards "B-PRODUCT" "L-PRODUCT". I know that rejecting things
that are almost correct isn't always easy. As you train your model in the loop you can
sometimes become a little attached to it and you want to reward it for almost
getting it right. But you really have to stay tough. Another question that sometimes comes up
is: Should I reject or skip? Prodigy lets you perform three main actions: accept,
reject and skip.

Those will be added as the "answer" key to the created annotation
task. If you skip an example it will be excluded from pretty much everything. If
you train a model on the created data set or use it as an evaluation set, skipped
examples will be filtered out. So when would you want to do this? The skip
action is really mostly intended for very specific examples that shouldn't
even be there. For instance, if you're annotating comments straight from the
web you might end up with broken markup or one sentence in a different language.
Sometimes you also have examples that are confusing and difficult, so instead
of spending a long time thinking about it, it's better to just ignore it and
move on. That's especially true if you have lots of raw data you really don't
want to lose momentum on one single stupid example. That said, if you choose
to ignore examples based on some objective – for example, broken markup or
the wrong language – you also shouldn't be evaluating against a set with those
types of examples in it.

And if your runtime model needs to be able to
handle certain types of texts, those should be present in both your training
and your evaluation set. To give you an example: if your model needs to deal with
tweets in real time as they come in, you really want it to be trained on a
representative selection. If you filter out all the noise during annotation your
model never actually gets to see it and will likely perform pretty badly on it
or get very confused. By the way, one tip we often give people who are dealing
with lots of messy and noisy data: experiment with chaining together two

Start by training one on a very simple binary distinction – not noise
versus noise. For example tweets with actual content you want to analyze
versus tweets consisting of an emoji, a link or just spam. Next train a
classifier for your actual objective that only runs on the data filtered by
the previous classifier. This is also much easier to annotate. You will only
have to assign your labels to the filtered set of texts that actually
matter and not reject hundreds of examples that are just noise. Prodigy's interface is really designed
for smaller chunks of text like sentences or single paragraphs. The built-in recipes will also try to split longer texts in two sentences wherever possible.
But what if you're annotating named entities and you or your annotators just
need the previous paragraphs to decide if a label applies or not? Well, the thing
is, if you're doing named entity recognition, the model is actually
looking at the very local context, which means the surrounding words on each side
of the token.

So as a rule of thumb, if you or your annotators are not able to
make the decision based on the local context, the model is unlikely to be able
to reproduce that decision. You definitely want to find out about things
like that as early as possible in your experimentation phase when you click
through a few examples yourself and try it out. You don't want to ask someone to
label thousands of documents only to find out later that your model isn't
actually able to learn any of the entity types you've come up with. Labeling at
the sentence level is always a good sanity check in that way. If you're doing
long text classification that's a little different because here your end goal is
to predict labels for the whole text. But still, most implementations for long text
classification usually predict those categories by averaging over the
predictions for smaller chunks like sentences. So when you're labeling data
for long text classification you might as well label it all at the sentence
level or paragraph level.

Your annotators will be able to focus better, produce
higher-quality data and give you a lot more to work with: one label per sentence
at about the same cost. If you want, you can even experiment with automation and
pre-select the sentences with a higher information density. We can do all of
that stuff pretty well with NLP so there's really no need to waste the
human's time by asking them to read through thousands of irrelevant filler
sentences. Essentially, what I'm trying to say is, when designing your annotation
tasks you don't always need to annotate exactly what you want your runtime
system to output.

What matters is how you break down a larger goal into smaller
solvable machine learning tasks. And the shorter the dependencies you're trying to
predict, the easier it usually is for the model to learn them. When you start a new NLP project you often need to decide: Do I start off with
a pre-trained model and fine-tune it, or should I train a new model from scratch?
Prodigy lets you implement workflows for both scenarios. But how do you decide
which one to go for? Fine-tuning pre-trained models is especially useful if
you need to predict the same categories and just want to improve accuracy on new
specific data. You'll need much less training data – sometimes even a few
hundred binary decisions can have a big impact. That's like 10 minutes of
annotation with Prodigy. But there's also a downside to using pre-trained models,
because by definition, they come with pre- trained weights. And whatever you do, you
always need to manage the existing weights. Every update you make interacts
with what's already there, often trained on millions of words.

This can sometimes
lead to very confusing results. One common problem is what's often referred
to as "catastrophic forgetting". If you update a pre-trained model with examples of a
new category but you don't includes examples or what it previously predicted, it
may "forget" what it previously learned. There are ways to prevent this but it's
something you always have to think about and design around. You might spend a lot
of time hacking around arbitrary side-effects of the existing weights and
maybe that time could be better spent creating a new dataset to train a model
from scratch. Training a model from scratch requires a lot more data but it
also gives you a somewhat blank canvas to start from. You can define your very
own label scheme or use your very own category definitions without having to
consider the existing weights.

So if your labels and their definitions are very custom
and far off from any generic pre-trained model, you should consider training from
scratch. If you have enough raw unlabeled text, you can still use some automation
to speed up the labeling process. For example, you can stream in texts that are
already labeled by an existing pre-trained model and then extend those labels
manually. If you want to train your own named entity recognition model from
scratch but you do want to include the label "PERSON", you can have spaCy label
that part for you. Even if the model is only correct 70% of the time, hey, that's
70% less manual labeling work you have to do. Prodigy implements a workflow like
this in the "ner.make-gold" recipe by the way. Finally, if you've been reading
some of my comments on the forum, you might have seen me make this point
before. But I honestly think that one of the most powerful but underutilized NLP
techniques is combining generic statistical models with application-specific rules to extract more complex relationships.
A simple example of this is the following: the corpus spaCy's English
models were train on defines a person as just a
person name, so without any titles like "Mr" or "Dr".

This makes sense because it
makes it easy to resolve those entities back to a knowledge base. But what if you
need the titles? Trying to fine-tune the model to completely change its
definition of "PERSON" is probably going to be a very painful process. All its
weights are based on that definition and you probably need a lot of data to
change that. However, syntactically, there's one thing
all of these titles have in common, at least in English and similar languages.
They come right before the person name and there's a limited number of options.
So to check for the titles, we can take a predicted person entity span and look at
the previous token, the previous two or maybe the previous three. This lets us
capture "Mr", "Prof Dr" or even "Prof Dr Dr", which is actually
surprisingly common in Germany where people are really into titles.

So in code,
the whole thing could look like this. We can expand the entity selection to
include the title tokens or add them as custom extension attributes which we can
retrieve later on. This example might seem quite basic, but there's actually a lot
more complex stuff that you can do in a similar way. The part-of-speech tags and
dependency parse hold so much information that you can use to go from generic
labels to specific structured information. For more examples and ideas
check out the links in the video description. I hope you enjoyed this
video! Thanks for using Prodigy and for all the feedback and great NLP
discussions on the forum. If you haven't seen it yet, also check out our prodigy-recipes repo on GitHub that includes a collection of recipes scripts for
various different annotation workflows.

They're also great starter recipes if
you're looking to build your very own custom pipelines. If you want to see
another video on different questions let us know on Twitter!.

As found on YouTube