Success

Fall 2019 NLP Seminar: Nathan Schneider & Vivek Srikumar

Google+ Pinterest LinkedIn Tumblr

[LAUGHTER] And OK. Good afternoon. And welcome to, I believe,
the first NLP talk of the autumn quarter, 2019. We're live streaming thanks
to some very nimble help from computer science
and engineering support. We're broadcasting live. I'm delighted today to
introduce two speakers. I'll introduce the first one. We'll hear the talk. And then, after questions, we'll
move to the second speaker. And I'll introduce
him when it's time, assuming his slides are
ready when it's time for him. [LAUGHTER] OK. No pressure. So first up today, we
have Nathan Schneider, who is an assistant
professor in computer science and linguistics at Georgetown
University, where he's been for about three years. Prior to that, he was a postdoc
at the University of Edinburgh working with Mark Steedman. And prior to that, he was a
PhD student working in my group back when we were
at Carnegie Mellon. And he– Nathan is known for a
wide range of exciting research in the area of broad coverage
national language semantics. He is very much at the
linguistics extreme of computational
linguistics in NLP and has contributed new data
sets and lots of new thinking about what we ought to be
modeling in addition to models themselves.

One of the things that makes
him sort of unprecedented in my group was while
he was a PhD student, he was essentially running
his own annotation shop and building new data sets and
managing a team of linguists to create, in
particular, new data sets around multi-word
expressions, which formed the core of this thesis. So if you're a student
working in NLP, don't let your student
status stop you from being innovative
about the data. To me, that has set
a nice precedent. So he's going to
talk to us today about preposition semantics. And with that, take it away. All right. Thanks, Noah. And thanks to all of
you for being here. I'm thrilled to be here. This is my first time in this
building since 2008, I think. So it's great to be back and
to have such a great group around NLP. This is kind of one of the
epicenters of NLP in the US. And so I'm looking forward
to sharing some of the stuff that I'm working on
and feedback, as well.

So to just give the general
background, fitting with what Noah started off, the sorts
of questions I'm interested in and people in my
lab at Georgetown are generally under the umbrella
of working with linguistically inspired analyses of text– so in particular, what
linguistically inspired analyses can we obtain from both
human annotators and machines with machine
learning and so forth with a focus on
making these analyses accurate, robust,
efficient, comprehensive, and covering text
corpora ideally in lots of domains and
languages around the world? So to set the
background a little bit here, we can ask, what
does meaning look like? Or what should it
look like if we want to get systems that
understand meaning and can communicate with
human style language? And of course, the sort of
de facto in our field right now is that a lot–
to a large extent we sort of express meaning
with some sort of vector representations or
matrices or tensors.

And these are learned in a
complicated neural network. And they are very good at
capturing certain regularities in text. But they are also really
hard to understand as a human what sorts
of generalizations they are capturing. But of course, there is a huge
tradition in computational linguistics and NLP of thinking
about meaning in other ways, whether it's some sort of
relationship between sentence pairs, whether it's some sort
of graph-based representation of predicate argument
structure and stuff like that– so this is an example of
AMR, the abstract meaning representation– or whether it's some sort of
more formal logical semantics that has scope and
stuff like that. And so these are all sorts of
things that are on the table and that I think are
valuable research directions. And my own perspective is that
it's not like one of these is the correct way, and
the others are incorrect. But rather, they have
different design goals and our targeting sort of
different aspects of meaning.

Meaning is– I like to think of
meaning as a really rich cake that has lots of layers. And we can think about
the more lexical aspects of meaning maybe in
a very abstract way that sort of generalizes
well across languages. Or we could think about
more specific, like, lexicon entries,
like the difference between a dog and a cat or
more abstract nouns and verbs and so forth,
where we might need language-specific
lexical resources. Or we may think about sort of
more structural aspects of how meanings come together, semantic
relations that show up in text. And this could be also
tied to a lexicon. And then, at the top
of sort of semantics, we have the– or the most
detailed representation perhaps is some sort of formal
logic that has– that you can do inference
on in a formal way, symbolic inference. And then, of course,
there are other dimensions of meaning that are
not simply semantics, but involve pragmatics, world
knowledge, social meaning, and so forth. So my general
perspective is that we need to look at different
layers of this cake and try to understand how the
layers fit together or could be made to be more
interoperable and how they relate to things like
neural language models.

OK. So the particular
focus of this talk is going to be on adpositions. Adposition is a term in
linguistics that covers prepositions and postpositions. So prepositions
are things we have in English, like "in,"
"on," "at," "by," "for." These are these words that help
connect two things together. Usually, prepositions are
followed by a noun or noun phrase.

And you know, other
languages have prepositions, like French and German. But languages, like Korean
and Japanese and Hindi Urdu have postpositions. And so that just means it
comes after the noun phrase. This talk is mostly going to be
about prepositions in English, but then adposition is a more
general term when I start talking about other languages. And we can look at the
typological surveys of the languages of the world. And we see that there are
some interesting geographic distributions of prepositions
versus postpositions. The red ones– the red dots
are prepositional languages. So you see most of the languages
of Europe, sub-Saharan Africa, I guess, and Polynesia. And you see concentrations
of postpositional languages in different areas of the world. So this whole study– so prepositions are
really fundamental, it seems, or adpositions are
really fundamental to the way grammatical structure and
meaning gets put together in many languages or
if not most languages.

So I'm going to be talking about
a project about the focusing on the meaning of
adpositions that started as a collaboration
between myself and Vivek and Jenna
Lang, who's now at AI2. And it's not a coincidence that
Vivek and I are here today. We wanted to coordinate our
visit to meet with Jenna and hash out some new
directions in this project. But we could not
have done it alone. There have been many, many, many
collaborators over the years. We started working
on this in 2014. And we're still working
on it, because adpositions are so darn complicated. And so we have gained
many collaborators from places like the
University of Colorado Boulder and my own students
at Georgetown and other people, who have been
involved in different aspects of this project. OK. So why should we care
about preposition semantics or adposition semantics? So when I go to a conference,
like ACL, and I mention– you know, someone says,
what are you working on? And I say, I work on
preposition semantics.

I get one of two responses. If I'm talking to a computer
scientist, I get a– [LAUGHTER] –sort of a blank stare often. Like, what? I thought prepositions
didn't have meaning. They're like stop words. We throw them out. Like, what– you know, why
should we– why should– why does this even make sense? On the other hand, if
I'm talking to someone with a lot of
linguistic training, I get sort of a look of pity– [LAUGHTER] –because linguists know
that adpositions are just incredibly complicated and
enmeshed in all kinds of parts of grammar and different
from language to language. So it's a really
challenging topic, but also relevant to NLP. And so the best argument I
found for why we should care is this little cartoon that
somebody put on the internet. This is Wilbur. And he has a sign saying,
"will work as food." And the caption is, due
to his grammar mistake, Wilbur found a position.

It just wasn't
the one he wanted. So the culprit here is
this little word "as." And maybe Wilbur was not a
native speaker of English. Maybe his native
language was Pig Latin. [LAUGHTER] But he clearly meant to
say "will work for food." And he said he will
he will work as food. And things ended tragically
for Wilbur because of this. So this shows that this little
word has an important meaning and at least a very
different interpretation of this sentence. But it's not just,
you know, cartoons. We can also find examples
of real language, where prepositions are
ambiguous, even to readers.

So I saw this the other day
and was like, what's going on? So Senator Dianne Feinstein
laying the groundwork to sue DOJ for release of
the whistleblower report. So DOJ is the
Department of Justice. She is suing for release of
the whistleblower report. So does this mean–
who thinks this means she wants the release
of the whistleblower report? And who thinks this means she is
suing because somebody released the whistleblower report? OK. So some people answered both. So I was surprised to see this,
because my initial reaction was the second one, just from
reading the sentence that you sue someone because
of something they did, like if you are
challenging what they did. But the actual intended
reading from context is that she wants
the whistleblower report to be released. And the government is– DOJ is withholding it.

So this shows an important
way that this little word is– leads to different outcomes and
maybe different interpretation of the sentiment of or opinion
of Dianne Feinstein with regard to release of the whistleblower
report and so forth. So we want ways to handle this–
to formalize this and handle this in NLP. Here's– I couldn't resist,
given what day it is today. This one, also with "for"– How to Actually
Apologize for Yom Kippur. [LAUGHTER] This does not mean I apologize
for the existence of Yom Kippur. It means apologize to people
on the occasion of Yom Kippur, because that's what you're
supposed to do on Yom Kippur. And then, finally this one– [LAUGHTER] –smoke detector installed on– and somebody filled in
"the ceiling," somebody being a bit of a smart aleck. So these are all
ambiguities of prepositions that are very clear to
us as human readers.

And of course, many
of the things that– most of the time, we
don't even notice this, because we have such good
world knowledge that we're able to disambiguate as humans. Another reason that
I think prepositions are really interesting is that
they're extremely frequent. Something like 10%
of tokens in text will be prepositions in English. This is a word cloud
distribution for English words. And you see function words, you
know, are extremely frequent. And then, the red ones
are all prepositions. So this is– if you
could better account for the meanings
of these things, you would be covering more
of the data, a lot more of the data. But the challenge here is
that with great frequency comes great polysemy. So you take a word like "for." You already saw a
few of the meanings. It can be leave for Paris, where
it's somewhere you're going. Eat for hours– all
these other meanings. And we can give them sort
of descriptive labels and with many very
similar meanings are expressed by other
prepositions like "to" and on.

And maybe you have to change
a little bit of the context. But you get a similar meaning. So our goal is going
to be to come up with a set of these labels
that we have definitions for, and we can apply to disambiguate
prepositions in English. So we can characterize
these ambiguities. So with "will work
as food," the reason that it gets a funny
interpretation is that "as" here has an identity
sort of meaning, meaning Wilbur is being equated
with food versus the intended sentence was "will work for
food," where food is somehow related to the purpose of work.

The "installed on the
ceiling" example– we're talking about
a locus or location versus a time, which was
probably the intended use of that blank line. And with the suing for a release
of the report, it's a purpose. Do you want the
report to be released versus an explanation–
why are you suing, or what was done that you
are suing in response to? OK. So the challenge here in coming
up with a set of labels– you know, it's pretty
natural to say, you know, a location, time, purpose,
et cetera in principle. But prepositions are not
just about space and time, even though that's
sort of– they're sort of rooted in
space and time. But then, they get extended to
all kinds of other meanings. And so pretty much
any kind of, you know, almost any kind of semantic
relationship between two entities or an entity and
an event or two events can be expressed, at least in
English, with a preposition. So things like
causality and intention and comparison and
emotion and all this stuff involve prepositions
in some cases.

Why is this interesting for NLP? Well, I've showed you– I've sketched out a
disambiguation task. We know from syntactic parsing
that PP attachment is highly– is a difficult part of that. So maybe if we had
semantics, we could do better at that or vice versa. The– if you want to build
some sort of explicit meaning representation, like
a graph-based meaning representation,
prepositions are going to be an important signal
in the training data. And I'm especially interested
in these, because they are– they're sort of part
of the– they're sort of in between the lexicon and
the grammar of the language. And they help you see
the meaning distinctions that a language considers
important enough to be grammaticalized, to be
really tied to different– very productive uses. We also might want these for
working with second language data, helping second
language learners.

Here was an example I saw
in a Starbucks in Korea. It was translated as "seat
available on upstairs." And while this is
perfectly understandable, it's not– it's clearly not
the way a native speaker would express this. So there's all
kinds of subtleties in the way prepositions are
used within a particular– or adpositions are used
within a particular language. And they're very difficult
to learn as a second language speaker. Relatably, machine
translation makes a lot of preposition errors. And so maybe we should use
some sort of explicit semantics there, as well. OK. So I'm going to talk
about an approach that we published at ACL 2018. And this focuses on three
different dimensions. First, we had to develop
a descriptive theory of, what are these labels that
we're going to assign? How do we define them? How do we train
annotators to apply them? Secondly, we created
a data set, where all of the types and
tokens of prepositions have been annotated with
these semantic labels. And finally, we built
some disambiguation– we did some experiments
with disambiguation. So I should first
acknowledge that we were not the first to think about
preposition meanings in a computational setting.

This sort of
groundbreaking work here was done by Ken Litkowski
and Orin Hargraves within– they built a large lexicon of
preposition– very fine-grained preposition senses that
they applied to corpora. And there were some shared tasks
on dismbiguation and so forth. There's also literatures
in linguistics, especially cognitive linguistics
on how these preposition meanings can get
extended, and new meanings can be added based
on the old ones. There is also work on general
purpose semantic relations. And this is a little bit
closer to the way we approach the problem is that we have a– we're coming up with
a fixed set of labels that we are trying
to disambiguate into rather than–
that are– you know, that have human
interpretable names, rather than being very
specific to each preposition. So there was some work
on semantic relations. Our work is, however, the first
of these class-based approaches that is comprehensive with
respect to tokens and types, meaning if I give you
a sentence, I want to– or a [INAUDIBLE] I
want to the system that tags all of
the prepositions with some sort of
semantic label, annotate all the prepositions. OK. So our approach is with
these coarse-grained labels that I've already
started to show you.

We call them
supersenses– supersense meaning higher
level, more abstract than a traditional sense. And so they look like this. So if you say, "the
cat is on the mat in the kitchen on a
Sunday in the afternoon," we're going to cluster together
these two as locus or location and these two as time. OK. And so we are not capturing– I want to be very
clear– we're not capturing the full meaning
of these prepositions. You can't swap– you
can't say the cat is in the mat on the kitchen. These are not interchangeable. But in this context, they have– these contexts, they have a
very clear locational meaning. These context is
the time meaning. We are comprehensive. We want to tag all
of the instances. And we also include
possesses, because "of" is a preposition grammatically. Apostrophe s here is
a possessive marker or a generative marker. But they alternate a lot. And they cover a lot of
the same semantic ground. So we also include
these possessives. So pages of the book
or the book's pages– these are part-whole relations. And there's a technical aspect,
which is that in some cases, we assign two labels to
a token to distinguish the sort of semantic
role with respect to a scene and the more
lexical information that the preposition
is signaling.

I'm not going to
get into this today. But look at the paper
if you're interested. How am I doing on time? SPEAKER: 12:30. NATHAN SCHNEIDER: 12:30– OK. I'm going to try to move
a little faster now. So here is– after many
years of back and forth, here is our current inventory. It has 50 of these labels for– that we are going to use
to disambiguate with. I'm not going to give you even a
tiny bit of the gory linguistic work that had to go into
this with lots of discussions and weekly conference
calls since 2014. But the gist of this is that– so we call this the semantic
network of adposition and case supersenses. It's this hierarchy structure. SNACS is for short. So you see that it has these
three different regions. One is called
circumstance for things that often modify events. So like, place and time
and means and manner and purpose and stuff like
that tend to modify events. Then, we have participant,
which tends to be core arguments and events, like agent and theme
and stimulus and experience and recipient.

And then, we have
a configuration, which tends to be
for things that are relations between nominals. And so traditionally,
these were– they were sort of the
semantic role labeling literature which dealt with
the first two categories and then the relations
between nominals literature, which dealt with the last one. So we had to do some work
to bring them into unity. Here are some annotation
procedure details. We started out with
University of Colorado Boulder students, who already had done
some linguistic annotation. We trained them
with our guidelines. We improved the guidelines
as we were going. We measured interannotated
agreement over time.

And it was in the 60%'s and
70%'s range at that point. You know, as the
guidelines got better, our agreement went
up and so forth. Also, this was with
our older hierarchy, which had 75 categories. So we hope the new
one is a lot better. But the upshot is we have a
[INAUDIBLE] data set called STREUSLE for supersense
tagged repository of English with a unified semantics
for lexical expressions. For those of you
who may not know, this is a picture of
a stresusel, spelled "el," or [INAUDIBLE]
in German, which is a cinnamon sugar
topping on baked goods. And this is because lexical
semantics is delicious. [LAUGHTER] I keep bringing up– I don't know. So this is annotated on
top of online reviews, so from the English
Web Treebank. So it also has gold
universal dependencies.

But we've added the
semantic annotations. And in my dissertation,
which [INAUDIBLE] alluded to, we did multi-word
expression annotations. And then on top of that,
we added some noun and verb supersenses. And relevant to this
work, we added preposition and possessive supersenses. So it's a small corpus, but it's
been very carefully annotated. And you can download
it right here. And it's still
undergoing improvements as we refine the scheme. So some actual examples– three weeks ago, burglars
tried to gain entry into the rear of my home. We say that gain entry
is a multi-word verb. And we want to annotate
"ago" into "of" and "my." And this is reasonably
straightforward.

So "ago" represents a time. "Into" represents a
goal or a destination. "Rear of my home" is
a part-whole relation. And "my" indicates
possession of home. But we frequently
encounter difficult cases, like, somebody provided
us with excellent service. So what is this "with"? Came with a great deal of
knowledge and professionalism– what is this "with"? We call this one theme and
this one characteristic, but there's a whole lot of
details into how we came around to that. And this "of"
represents a quantity. Our guidelines currently
stand at 85 pages. So this is not an easy
annotation task, not something we could just
directly crowd-source. But luckily, luckily
for me, I have a lot of linguistics
students who like to do this kind of annotation. But in the future, we would very
much like to find, you know, which parts of this
annotation are simple enough that we could do it
on a larger scale, and then have experts only
work on a small subset.

I said that prepositions are
not just about space and time. Here is a pie chart of– spatial and temporal
are only about a third of the tokens in our data. Now, remember, this
is online reviews, so people are talking
about their experiences at restaurants. If they were talking
about soccer matches, maybe there would be
more space and time. But getting this rest of the pie
is actually really important. Here are some of the types of
both single-word and multi-word prepositions. And, of course,
deciding what should qualify as a multi-word
preposition is challenging. OK, audience participation–
what do you think is the percentage of the
token distribution represented by the top 10 types? So if you take the top 10
most frequent prepositions, what percentage of all the
prepositions do they represent? So who thinks it's
between, I don't know– who thinks it's under 50%? Who thinks it's
between 50% and 75%? Who thinks it's
between 75% and 100%? OK, so the results are here.

So if you look at the
top 10, it's really just, it's really about 75%. Very extreme long tail
here, which is fine. But if we look instead
at the semantic labels, their distribution is
a little more even. I mean, it's still,
we have some– still have some low-frequency ones. But it's a little
more spaced out, which suggests
that this is really something about the language,
not about the meanings that are being conveyed. We also, as I mentioned,
include possessives. There was some
previous work on this. We are able to
capture alternations, like the pages of
the book, the book's pages, the murder of the boy,
the boy's murder, and so forth. And this can actually– this data set, we hope, will be
used for more linguistic study of this alternation.

I should also
mention that we did– we also have started annotating
The Little Prince in English. And we have a new, a more recent
estimate of interannotator agreement, which is
around 78%, depending on how you measure which pair
of annotators and so forth. So we think this
is close to being a pretty good level of agreement
for a 50-way disambiguation task. But we're always
trying to improve this. OK, so now a few quick
slides about disambiguation. This was in 2018. So we, of course, have a
most frequent baseline. And then we did some very
standard classification techniques, as of 2018, which,
of course, has now all changed. We did a BiLSTM with a
multi-layer perceptron classifier. And we also did an
SVM with features based on Bivek's previous work. And the important– what
we're wondering here is just what is sort of
the baseline difficulty of this task if you
use machine learning? And here are some
accuracies using– assuming we know which
things are supposed to be considered
prepositions, and we have gold syntax features,
at least for the SVM model.

So the most frequent baseline
is actually quite low. This– you don't have to worry
about the difference between these two– these three columns. But basically, the most
frequent baseline is very low. And this shows that
these prepositions are, in fact, very ambiguous. It's not just that the
most frequent sense is 90% of the
distribution or something. If we use some machine
learning, it helps a lot. The context matters a lot. So here's our neural
system, 70s or low 80s, depending on which
measure you're using. And here is our SVM, using
features, about the same. If we looked– we looked
at the frequent errors. And it was doing things
like overpredicting the most frequent
label, and also not getting the nuances right
between some of these closely related categories,
like SocialRel is for relations between
individuals versus relations between an individual
and an organization, like, oh, you work
for a company.

And also these labels,
which is sort of like more concrete
possession, literal ownership versus more
abstract forms of possession. Fortunately, some
people from this area came along and applied
contextualized word embeddings this year. So this paper, they used
our data set and evaluated. Assuming you knew which things
should have preposition labels, they got– they improved
on our 60s and 70s numbers to get up to 80 to 90. So there is a lot of
improvement you can get from contextualized embeddings. But GloVe by itself
works terribly, for some reason, presumably
because these things are so ambiguous. So just having a
non-contextualized representation is not going
to buy you a whole lot. Actually predicting
which things should be treated as prepositions
is not trivial, though, and because some of them
are multiword expressions, some of them are
idiomatic, some of them are ambiguous with
other parts of speech. So we're now working– I'm working with Nelson
and my student, Michael, and some other people on
trying to build a full sentence tagger that does all of this. Another interesting paper I
want to give a shout-out to is from Starsun this year. And this did not
use our data set.

But they looked at
different classes of function words in
natural language inference, and different ways
of pretraining, and how this affects whether
the natural language inference systems are reasoning correctly
of these function words. And they found that pretraining
matters a lot for prepositions. If you pretrain based
on image captioning, it will get some of
the concrete senses but not the abstract senses. OK, with the maybe
5 minutes, or I guess I should
wrap up very soon. But just a quick flavor
of some of the extensions we're looking at
to other languages. This is called
Case and Adposition Representation for Multi-Lingual
Semantics, or CARMLS. These are the
languages we've been looking at so far,
the prepositional ones and the postpositional ones.

To give you some intuition
here, if you look at a, like, "for" versus
"to" in English and compare that
with French, there are three major prepositions in
French that translate as "to" or "for" a lot. And maybe if we have
semantic labels, we can distinguish
these different usages better than not. So this is kind of
the vision, and this is why we're
interested in looking at many different languages. Of course, we have to
define exactly what counts as an adposition or a case
marker in these languages, and deal with morphology, and
compounding, and all kinds of interesting stuff. But the most
interesting question is whether these semantic
labels that we have really do generalize to other languages. We tried to make
them fairly general. So a couple examples
of interesting things we've encountered. In Mandarin Chinese, we have an
interesting system where there are sort of
prepositiony sort of, I think these are
called coverbs that can interact with sort of
another modifier that's also prepositiony, which
are called localizers.

And so we had to decide
how to analyze these. So you say something
like, at more generally. There's an at-like relation. And then you refine it
by saying on top of. I don't speak Mandarin. But this is one
interesting case. And some of the
students at Georgetown have actually annotated
the full Little Prince with the supersenses. And so we can start to look at
across linguistic differences. For German, there's a
really fun phenomenon where German has
morphological case on nouns. It turns out there's
an interaction between the preposition you use
and the case of the object noun phrase in terms of the meaning. So if you want to
talk about, if you use "in" as the preposition,
you use the dative case for the noun if you are talking
about a static location, and the accusative case if
you're talking about a goal. So in dem auto is in the car. In das auto is into the car.

And it applies the
same way for a lot of these locative prepositions. So we have to worry
about the case as well. And my student
Jakob has annotated The Little Prince in German. And Korean, led by
Jenna, is looking at many interesting
phenomena, where the labels we have for English
don't necessarily suffice, including
these pragmatic focus markers like "only,"
and "also," and "even." So you can say– which are expressed
post-positionally. So bread-to eat
means eat bread also. Bread-man eat means
eat only bread. And there's a bunch
of others as well, so this means we have to create
some sort of additional set of labels for this. I'm going to skip over this. We have some quantitative
preliminary numbers on how– on a very
small sample of aligned prepositions across languages. But we want to annotate more
to get a better sense here. But moving forward, we're
interested descriptively in what are the similarities
and differences in adpositions and case systems
across languages. Also, annotation
efficiency– can we make this very complex task
simple, or at least partial, you know, identify a simpler
part of it and crowd-source it? And finally, we
want to build models not only for English but
cross-linguistically, and apply them in applications.

Some of these may– some of these things
we're also looking at are, like, looking at
second language English, the adposition use, and whether
it differs and is influenced by the native language. We're looking at whether
we can generalize these labels to apply
to semantic roles that are not marked by prepositions. And a plug for a paper that my
student has coming up at CONOL is looking at the integration–
so taking these labels and putting them into
graph-structured meaning representations, like UCCA,
where we have them annotated on the preposition, but they
can be incorporated really as role labels.

So please take a look at
that if you will be at CONOL. And we have results on
different multitask sorts of architectures for
parsing with this combined representation. And it turns out, doing
it all jointly is better. OK, so to wrap up– I know I'm probably over time– I hope I've convinced you
that adpositions and case markers are interesting and
challenging and important for NLP in terms
of their semantics.

And we have a long
way to go if we want to cover all these
different languages and all these different facets. So thank you. [APPLAUSE] NOAH SMITH: We have
time for some questions. NATHAN SCHNEIDER: All
right, I saw your hand. AUDIENCE: How did you deal
with phrasal verbs in English? So like, you know, English
has quite a lot of these, where the sort of preposition
that tacks on to a verb is very, very idiomatic and,
like, leaches meaning as time goes on. So like hang up and look up,
beat down, like some of these are less opaque than others. But also, you have
this problem of they're not always right
next to the verb. Like you can say, look my
wife's doctor's dentist up sort of thing. And then how did you deal
with, like, this "up" really doesn't mean
anything at all, I guess? NATHAN SCHNEIDER:
So great question. I took out a slide
that addressed this. We had– we previously
did multiword expression annotation. So we already had marked
these idiomatic ones, like look blah-blah-blah up as
a gappy multiword expression. So in that case, we did not
annotate the preposition semantically because we
just treated it as a verb.

But there are sort of some
borderline cases where it's hard to know whether– so
if there was a spatial, like, I brought the book up,
meaning carried it up, then we would treat it
as a spatial preposition because it's being used
more productively there. AUDIENCE: Right. It just seems difficult
because a lot of the times, they start with spatial
meaning, and then sort of just lose that historically. And you might get, like,
border cases again. NATHAN SCHNEIDER: Yeah. Well, this is why multiword
expressions are so interesting. And if– the next
time I come here, maybe I'll give a
talk about that. I think Emily next? AUDIENCE: [INAUDIBLE]. NATHAN SCHNEIDER:
Oh, same question. OK. That's fine. AUDIENCE: Yeah, so I really
liked your motivations, like the cartoon pig. NATHAN SCHNEIDER: Yeah. AUDIENCE: And
computationally, there– a lot of these, like,
you have that example for cat in the kitchen
and in the afternoon. So is it possible that the
context itself tells you what the role of the preposition is? Or have you seen a difference? NATHAN SCHNEIDER: I mean, that's
the hypothesis of being able to train a sentence-level
disintegration system is that you– is that presumably
what it's learning is that if you see "in"
followed by a locationy thing, then it's a location reading.

And if it's followed by 12
o'clock, then it's time. But it turns out,
it's really– there– the factors there
are very complex. So it's difficult to
write a list of rules that capture all of that. AUDIENCE: Oh, yeah, sorry, I
didn't mean the [INAUDIBLE].. I was talking about
computationally, if you have a sentence
representation that just missed a preposition, like in– NATHAN SCHNEIDER: Oh, I
see what you're saying. So maybe you could have
a meaning representation that would figure
out it's a location without the preposition. Sure, I think the– there is– yeah,
there's always going to be this interplay between
the preposition itself and the other things that help
you figure out its meaning.

I think what's interesting is
that this helps us understand the sorts of things that
actually are ambiguous, and in what ways they
can be ambiguous, and how this might affect
translation of the preposition, and things like that. Yeah. Yeah. AUDIENCE: Do you have
an idea for if you were to actually deploy these
into sort of the main setting, like how much world
knowledge would be an obstacle towards being
able to actually [INAUDIBLE]?? NATHAN SCHNEIDER: How much world
knowledge would be an obstacle? I mean, we can come up
with examples like the– I think the example of
the DOJ subpoena example is really one that
requires world knowledge. I think probably
99% of the cases are pretty local semantic
information that is needed. But if you're looking
at, like, yeah, sometimes the preposition–
we also annotate things that are not
followed by a direct object. So I brought the book in. That doesn't have a
direct object overtly. So that's going to
be more challenging and require maybe more
context to know into where, and– or is that spatial,
or is it something else? Yeah.

AUDIENCE: So this is a
very exciting map here. It's just a little
scary to think about your current
process of having– going one language at a time. NATHAN SCHNEIDER: Yeah. AUDIENCE: You have to get
like a lot of grad students. [CHUCKLING] So do you think there's some
hope of this stuff projecting cross-globally? Or– NATHAN SCHNEIDER: Yeah,
this is something– we want data with a
bunch of languages, so we can start experimenting
with, you know, algorithms to help us do this. AUDIENCE: But it sounds
like you haven't had to have too [INAUDIBLE] yet.

Is that just because
you're in the early stages, or because you're hoping it's
going to be quite universal? NATHAN SCHNEIDER: Also, the– so for German and– for German, at least,
the preposition seems similar enough
to English in how they're used that
there probably are not going to be a whole
lot of semantic domains that are– now individual
prepositions are going to be different. But the semantic domains
that they express are going to be pretty similar. But looking at
Korean and Japanese, I showed an example from Korean. And there are a few
other things in Korean that are really not expressed
with English prepositions, or at least not as
regularly expressed with English prepositions.

So I think this
is just something we have to do for a number
of languages to see. Yeah. AUDIENCE: So we're trying to,
I guess, if one of our goals is to build systems
that understand language in a similar way
that humans do, I feel like with all
these borderline cases, there is some sort
of gap between, like, a system that
can just provide labels that are pretty good
for each preposition, and then a system that would do
the same thing that humans do, which is see these
borderline cases and become genuinely confused,
and start to think about all of the different implications. So there's this gap here that
exists between our systems that just try to label
examples, and the humans who are trying to reason about
what the labels should be. Like, and especially since
agreement is still, I guess, around that 80%,
like, I imagine, like, is there a roadmap? Like what do you think are the
steps to improve that agreement number and bring the systems
closer to what humans actually do? NATHAN SCHNEIDER: Yeah.

I mean, some of
the disagreement– I was involved with some
annotation this past summer. And some of the
disagreements are just really fuzzy boundaries
in the labels. There's– if you want
to represent, like, all the semantic relations
that are expressed in language, there's going to be a lot
of gray area that you cannot possibly tease apart perfectly. But we have found areas we
can improve the guidelines, like our– like possession. We had very vague
definitions for possession versus– the, like, ownership
sense versus the more general, like, having a property
or sort of sense. So, yeah, I don't
necessarily have a great answer, other than
partially improving guidelines. It would be interesting to
have a data set of things that are really
ambiguous to humans and that cause
different readings. But in my experience
with annotation, it's less that there are wildly
different really readings and more that humans are
not sure of the boundaries of the labels.

NOAH SMITH: OK, in
the interest of time, we should thank
our first speaker. [APPLAUSE] And switch over to
our second speaker. And for people who want
to make a graceful exit, this is your [INAUDIBLE]. [INAUDIBLE] OK, so while we do
the microphone switch, I'm delighted to introduce
Vivek Srikumar, who is an assistant professor
of computer science at the University of Utah. Vivek previously was a
post-doc at Stanford, working with Chris Manning,
and before that, did his PhD at UIUC with Dan Roth.

If my memory serves, you may be
the only person in the room who saw my first job talk. [CHUCKLING] Or maybe you're not that old. VIVEK SRIKUMAR: No. NOAH SMITH: OK. [CHUCKLING] VIVEK SRIKUMAR: I probably
missed it by a year. NOAH SMITH: You probably
missed it by a year, or you had something
better to do. AUDIENCE: So [INAUDIBLE]. NOAH SMITH: OK. OK, I'll start over. Apparently, the
people on the internet can't hear me right now. Sorry. VIVEK SRIKUMAR: [INAUDIBLE] NOAH SMITH: Yeah,
I'll do it again. And this time, you guys
can laugh even harder. I'll scare everybody. I'm waiting for the thumbs up.

OK, I'm delighted to introduce
Vivek Srikumar, in case you missed his name, who is an
assistant professor of computer science at the
University of Utah. Prior to that, he did
a post-doc at Stanford. And his PhD is from the
University of Illinois at Urbana-Champaign. Vivek has contributed
widely in NLP, focusing on machine learning. He's one of the people
who successfully rode the wave from linear
models to nonlinear models, with great contributions
on both sides of that ever-widening divide.

And we are delighted to
have him with us today. Today, he's going to talk
to us about what logic can teach to neural networks. So welcome. VIVEK SRIKUMAR: So thanks. I want to talk about
some recent work. So I see that Nathan has taken
about 10 minutes of my time. So I'll– NOAH SMITH: We started–
we also started late. VIVEK SRIKUMAR: So, no,
what I was going to say is we should probably
lock the door. [CHUCKLING] Because I was making my
slides, and at last count, it was like at 70 or something.

So anyway, I'm going to talk
about some recent work on logic and neural networks. This has been a hobby
of mine for a few years. This is joint work with my
students, Tao Li, Vivek Gupta, and Maitrey Mehta. So if you've been
following the news, neural networks can
understand text. I'm going to use this
task of textual entailment as an example through the talk. You're given– if
you've not seen this task before, you have
a premise, a sentence, and a hypothesis. And the goal is to decide
whether the premise entails the hypothesis, or it
contradicts the hypothesis, or none of the above. And if you take a
state-of-the-art system for this task– this is a
screenshot from the [INAUDIBLE] NLP demo– it tells you that
the premise entails the hypothesis, which is
nice, with a fair amount of confidence.

If you believe this, then
I have some news for you. The question that
I'm trying to– I've been trying
to think about is, is this just an
artifact of some sort of something magical happening
in the neural network? Or do these systems really
understand language? So here's a probe. Instead of two sentences,
let's take three. You have three
sentences, P, H, and Z. And John is on a
train to Berlin. John is traveling to Berlin.

So P entails H. John is traveling to Berlin. John is having lunch in Berlin. Can't happen at the same
time, so H contradicts Z. But if you go to
the demo now, it will probably say that
P contradicts Z. So if John is on a– so, sorry, P and
Z are unrelated, which is a violation of some
sort of consistency invariant that we expect.

If the first sentence
entails the second, and the second one
contradicts the third, the first should
contradict the third. It turns out that
this is not just one example on the collection
of hundreds of thousands of examples we trained
the model with BERT, and that gets like 90% on SNLI. It's at near the top of
the leaderboard, at least at the time we did it. About half of those
examples violate this sort of simple rule
that we can write down. So here's a rule in logic
that the system does not know.

So the question that
motivates this line of work is, how do we provide
such, let's call it advice, if you want, to neural networks? Here's another example
with reading comprehension. We have some text. And we have a question. And one approach for solving
this problem, one class of models, what
they do is they try to align words and
phrases in the question with the paragraph, and then
try to follow the point just to see what happens. So writing prose is related to
being an author of Latin prose. And long story short, the
answer is Julius Caesar. But this is nice, I mean,
the fact that [INAUDIBLE] can get this answer. But what if I would like to
inform the system that if you have two words that are
related to each other, then try to align them? Or don't rely on a
million labeled examples.

Here's a piece of information
that you can just use. If we have this information,
how do we inform this– how do we inform
models for this? So in this talk,
I'm going to talk about this idea of going from
learning from examples, which is the dominant way in which
neural networks are trained today, to learning with rules. Not learning from
rules, but with rules. And for that, we need
to think about how neural networks and logic
interact with each other. And all that will be
the bulk of the talk. And at the end, I'll talk
about some examples with– involving text
for understanding. So how do we go from
learning with– how do we learn with rules? The reason I'm interested
in this question is because I would like
to build neural networks without massive data sets. They are awesome if
you have a lot of data and a lot of compute. But what if I don't have data? Instead, I have an expert
who can write down some rules without too much effort.

So how do we
introduce– how do we integrate domain knowledge
with this sort of data set-driven learning? And the reason for this
is because, first of all, good data sets are really
not that easy to make. It takes some effort to make
a good data set, and years of work, and a lot of people. Thanks, Nathan. [CHUCKLES] The second– and so what I'm
getting at is we cannot– it is impossible for
us to have a data set for every task, and every
domain, and every language out there. So we need something else. The second observation
is that it's often easy to write down rules
about what we want models– how we want models to behave. So for example, in textual
entailment, I could say, if none of the content words in
the hypothesis have any strong alignment with anything
in the premise, the label cannot be
entail or contradict. It has to be unrelated. These kinds of statements are
often statements in logic. In fact, they
can– you can think of these as statements
in first-order logic that describe invariance about
the problem, not the data set.

And these are often, like I
said, universally quantified, and– but not necessarily. So what we need is the ability
to use these kinds of rules, and integrate it with data of
this kind, and build models. So for that, we need neural
networks to meet logic. So let's briefly look at
neural networks and logic. If you missed the
agenda, neural networks are awesome because they allow
for differentiable compute.

So you can essentially
throw it at back propagation and wait till it gets
you a good model. In some sense, it provides
an easy interface to use. However, they're kind
of hard to supervise, because mostly, they
accept labeled examples. First-order logic,
on the other hand, does not involve calculus. It's hard to beat
it into TensorFlow. However, it's easy to state. I mean, people are, you know,
a domain expert can write down these kinds of things. And it's pretty expressive. So what we want is everything. What we would like
is differentiable learning that is
easy to use, namely with the best software
that's out there today, and all the cool
advances on embeddings, and representations, and all
that, and at the same time, use logic to guide the learning. So, of course, I mean, it's
not like this is the first time logic has made its way into– logic and constraints
and those things have made their
way into learning.

And the way I see it, there are
a few different places in which logic can get involved. It could somehow help
us design the models. And think, for example, the
neural network architecture is somehow informed by some sort
of a constraint that we have. It could influence how– it could influence what
loss function we minimize and how we train. It could also constrain
how predictions are made. We could constrain predictions
to be only those that satisfy some constraints. The last one is essentially
structured prediction.

It's fun. However, with neural
models, training can be a bit difficult
because logic is discrete. And the other problem is if
we go that way, every time we make a prediction
in the future, we need to have this sort
of constrained inference. So there's the
computational cost. Not saying it's not good,
because I do that also. But in this talk, I'm going
to talk about the first two, and talk about how we can
augment neural networks with logic, and how
we can essentially use logic-based design
of loss functions. This is stuff that was at
ACL and AMLP this year.

I am going– I assume that I have
very little time. And I'm going way faster
than I had planned. So we can feel free
to interrupt and ask questions and those things. And then I can hurry at the end. All right, so there are
three different challenges that we have to address in
order to integrate logic with neural models. The first one is, how do we
bridge predicates and logic, or predicates that are inside
the rules that we write, with whatever is happening
inside the neural networks? The second one is this question
about differentiability. How do we make logic
differentiable? And the third one is,
suppose we did that. How do we use it? I'm going to introduce each
of these one at a time. So let's go to the first one. I would like to argue that
every neural network exposes some sort of an interface. You can think of
this as nodes that have meaning that is independent
of the neural network. So let me illustrate
this with an example.

So this is a dummy
neural network that takes a premise
and a hypothesis, and predicts the textual
entailment label. If you want, you can
think of the inside of this as some big bird
or something like that. You can think of
outputs as predicates. Here, this one,
this– the prediction that this is an entailment can
be seen as a predicate called entail, that has two arguments,
the premise and the hypothesis. You can think of
the neural network as assigning a probability
that that predicate is true. You can also write down
predicates about the inputs. The inputs are also
part of the interface that the neural network exposes.

And you can write down further
things like John is a person. And that's a predicate. But also, if you want to
open up the black box, maybe the neural network
has some sort of attention. So the cartoon here is that
words are encoded into vectors. These are these vertical lines. Both the premise
and the hypothesis are encoded into vectors. And then there's some
attention in the middle. And together, all
of these things somehow are blended together. And you get a label. This is a cartoon version of the
decomposable attention model, for example.

By design, we can think of these
attention edges as alignments. So the fact that there
is an edge between, say, traveling and train
can be written as the predicate that says
these two things are aligned. And we can think of the
attention as a probability that this alignment
should exist. So either way, I would like
to call these kinds of nodes in the network either the
attention, or the output, or maybe even these
non-existent nodes that are really
properties of the input, I would like to call
them named neurons.

In a computation
graph, these are nodes that have some
externally defined meaning. It's a new– so if you want
to think of this edge that connects these two
phrases, notable author and known for writing,
the neural network might have a node in that. Let's call that little a. And it's associated with
a predicate that says these two words are aligned. So this attention is,
in this case, aligned. Matter of convention, I'll
use lowercase for neurons and uppercase for predicates. Not every node in the
network needs to be named. There could be anonymous
units of compute that are just whatever back
propagation wants it to be.

But we are interested
in those nodes that we are able to
assign meaning for. And this could be a
controversial idea, because maybe we
don't want to do that. But I'm going to argue
that invariably, we have it either a priory by design,
like in this case, attention, or post hoc, we might
analyze a neural network and discover that a certain
node is the CAT neuron. And then we might want to
write constraints about that. So the named neurons give us the
vocabulary for writing rules. This will be the predicate–
these will be the predicates that– these will constitute
the predicates that we have for writing rules. And this will be the bridge
between neural networks and the logical
constraints that we have. So that answers
the first question. So we have neurons
that have some meaning. And we want to write
rules about that. The second one is actually
the slightly more tricky one.

And I realize now
that that's the one that I have the least amount
of material on but most amount of references for. What we need is to make
logic differentiable. Thankfully, this
has been the object of study for nearly
a century now, starting with the work
of Jan Lukasiewicz. And it has taken multiple
names in the past. I've been calling them– and I've been using this
notation from this book called triangular norms. You can call them t-norms. The idea is we want to
systematically relax logic into functions that are
hopefully continuous and maybe differentiable even. To do that, what we need
is to relax every operator that we have. There are only three of them. If you squint your eyes,
there might be four. We need to handle negations
conjunctions with junctions and implications. And the good news is
that if you like choice, there are many, many
such relaxations. I'll just walk through one
of them, the product t-norm.

If I have a predicate,
capital A. When A is true, 1 minus A is– when– assuming that true is
1, and false is 0, and A is 1, 1 minus A is 0. And so 1 minus A is a good
representation for negation. And this is a relaxation
that applies even when A is not 0 or 1. In general, you should
think of t-norms here as a relaxation
for the conjunction. And once you get negation
and conjunctions, you get the rest of
the Boolean operators. And every single system here
has some interesting properties, and it's developed in
this axiomatic way. The important point here is that
Booleans, in the Boolean case, inputs and outputs live in– they can either be 0s or 1s.

In the relaxed version of
it, the inputs and outputs are in the range 0 to 1. And like I said, there are
different kinds of t-norms. So in the work that
I'm describing, we used a product in one case,
and Lukasiewicz in other. There's been some work
from Sebastian Reel's group over the last couple of
years that uses the Goodell t-norm because it has some
interesting connections to adversarial learning. Anyway, so there's like
a lot of cool things you can do with this. And if you're
interested, you can just look up the references. Oh, another
interesting thing here is the Lukasiewicz logic
is basically built out of rectified linear units.

So that's your connection
to neural networks. So you can construct
rectified linear units and plug them into
neural networks. And that's kind of nice. And it also has some other
really cool properties that we can talk about offline. All right, so what we
have here is a system of relaxations of logic. And now– oh, by the way, these
are only three of the t-norms. There are an infinite
number of them. Not all of them are continuous. Not all of them
are differential. We are going to just
focus on the ones that are continuous and differential
because that's the world we know how to navigate. So if you want to make
logic differentiable, the answer is pick a
t-norm, and run with it. OK, what does it
mean to run with it? How do we use these things? So what can logic do for me? In general, I think constraints,
and logic, and rules, or whatever you want to call
them, introduce inductive bias. They make learning easier and
less reliant on training data. So in this case, we can
introduce inductive bias by either changing the
shape of the neural network, the architecture of itself,
so that the network ends up preferring parameters
that satisfy constraints.

Or we can use the logic to
define a regularizer that guides the model to regions
of the model space, parameter space, that satisfy constraints. And I'm going to go over
these one at a time. So the first one
showed up in ACL. And instead of going
through the gory details, I'm going to just give
you a cartoon example. So imagine that we have
a neural network here. And imagine there are
many, many nodes that we– that are not here on the slide. There is a layer
with all the As, and there's a layer
with all the Bs. And in between them,
there are many layers that don't concern us. B1, the node at the top, is
of special interest to us, because it is,
let's say, the input to B1, the layer that's
just before B1 is sum x, whatever it might be. So we have– and in the absence
of any other information, the neural network
does what it does. Let's say in addition
to this, we also have that one constraint that
says, if A1 and A2 are true, then B1 should be true.

Importantly, this rule says
that information from A1 and A2 can impact B1 independent
of what happens in these intervening layers. Of course, the network can
learn this information. They are expressive enough to
learn anything, apparently. So they can learn
this constraint. But then the question
is, why should you learn what we already know? So let me illustrate how we
can introduce this constraint into the neural network. First thing we can
do is we can take the left-hand side
of this constraint and write it down
in Lukasiewicz. I am calling that D. D of A1 and A2 is essentially
a function that says, how true is the conjunction? For example, if A1
and A2 are both true, then this whole thing becomes 1.

If A1 and A2 are both
high, D becomes high. If both of them are low,
then D is turned off. And then we define
a constraint layer. We're going to
change the B1 node. We're going to define a
constraint layer that says, there are two parts
to the input to B1. Let's call it B1 prime. That is the original input. That is the dot product of the
weights and its previous layer, plus this, the node D that
is regulated by some rho. Think of D as a soft constraint. If D is high, what happens
is the preactivation value of this B1 prime gets
high, which means B1 becomes true, more true.

If D is 0, then A1
and A2 is false. And we can't say anything
about the value of B, because the constraint is
true if– irrespective of whether B is true or false. In that case, we let the
network do what it wants. If A1 and A2 is false, the
D becomes 0, which means– if I can find my mouse pointer–
or this term here becomes 0. So the network just
goes with what it has. And what I'm arguing for
is, essentially, replace– go to your neural network. Just replace all the
nodes that have B in them to something like this. You introduce a new node,
D, that is essentially the Lukasiewicz conjunction. And in addition to the
inputs that are really there, you also have a new H. This
is a systematic process that we can do without adding
any additional training parameters. The rho here is a
single hyperparameter that you essentially can
think of as how much do you trust your constraint? If rho is high, I really want
my constraint to be true, or I really believe
my constraint.

If rho is low, I am not
entirely sure what it is. So there are no additional
parameters that are required. And the nice thing is this is
essentially a compilation step. You write a–
create a computation graph in TensorFlow,
or whatever, and add these
constraints, and just glue these extra nodes in there. And they could work. And, I mean, I'm
talking about it. So they did work at
least twice, three times. All right, so that's one way
of changing a neural network, essentially. Change the architecture. Or that's one way
of introducing logic into a neural network,
that is by saying reconstruct the network
to kind of prefer logic. Yeah. AUDIENCE: How do you deal
with existential bonfires [INAUDIBLE]? So all of these are
kind of universal.

VIVEK SRIKUMAR: Which ones? AUDIENCE: Like if you add this
rule, you are saying for– VIVEK SRIKUMAR: Right. So I don't know, actually. In this way of
doing things, there are some pretty
heavy constraints on what kinds of logical
formulae can be used. In particular, one interesting
thing is that whatever happens, we need to keep
this graph acyclic. So if you're not
careful about it, you could have some
constraint that says, A1 and A2 implies B.

And
B1– or B1 and B2 implies A3, and some weird
cyclicity could show up. So this is a somewhat
restricted operation. And, frankly speaking,
I'm not saying that this is the end of it. I don't know. Yeah. Yes. AUDIENCE: Isn't it hard
to identify the nodes– VIVEK SRIKUMAR: Yes. AUDIENCE: –which might be– VIVEK SRIKUMAR: Right. So that's the question
of these named neurons. I'm assuming that A1, A2,
and B1 are named neurons. And that is really the
slightly tricky part. The assumption here is
that as the person who is designing the network,
maybe I have some intuition about what's going on. When we were working
on this, this work was done pre-BERT, or pre
BERT becoming popular. So it was not, you know, you
have all these attentions, and alignments. And it was kind of
easy to think about it. And I've been thinking
for a few months now about how this
idea would work with these 12 layers of
transformers, or whatever, 20 layers of transformers. And I don't know. Yes, you're right.

This works if we have
these interfaces that are well defined. However, we– one
class of constraint that we did introduce, and
it turned out to be helpful, surprisingly, was just
connecting the inputs to the outputs. In a text-chunking
task, for example, we added a constraint that
says, if a word is a noun, then it should not be
in anything but an NP. You don't need to
understand what's going on inside the thing. I– my gut reaction was,
this is not going to help, because, you know, somehow the
thing is going to learn it.

It turned out it actually
helped in the [INAUDIBLE].. So it kind of helps. Yes. AUDIENCE: When defining
their named neurons, do you take into account
semantic similarity [INAUDIBLE] predicates? So is there a way of
capturing relationships between the predicates? VIVEK SRIKUMAR: I
haven't thought that far. Maybe yes. I didn't think about it. So I would like to
say yes, but that is without any
activity in my head. So I don't know. One could argue
that the constraints are capturing the
connective relationships between the predicates. And they take care of that. But that is a cop out. So let's say maybe. All right, so that's one way of
modifying this network, by just augmenting the network itself. The other way is to
acknowledge that it's hard to find named neurons that
are, you know, there are there.

And also, I would realize
that our models today can essentially learn anything. They can also learn
the constraints. They can, you know, maybe layer
four in BERT is good at logic. I don't know. And let's just let the
learner figure it out. So the idea here is
to kind of change the process of learning
so that regularization– so that we design regularizers
using constraints. This is going to show
up at the [INAUDIBLE].. And to do this, we need sort
of a unified representation of both examples and
constraints in logic.

So one could see that a
labeled example is really a predicate about a particular
pair of– a particular input. So if I have a pair of
sentences, P and H, that are– that an annotator has
declared to be an entailment, I can write this down as
a predicate, entail P, H, which this is a
statement that we would like models to preserve. But equivalently, I could
say true implies entail P, H. That's just a technicality. Constraints are also
naturally in this form. If I can say that if P entails
H and H entails Z, then P should entail Z. So if
you have three sentences, the first one
entails the second, the second entails the third,
the first entails the third. These are universally
quantified statements. If we have statements
of this kind– everything here is of the
form left entails right. So you have, at the top,
you have true entails– true implies entail. And here, you have
some left-hand side implies the right-hand side. The thing at the top
is labeled examples. The thing at the
bottom, they are rules. Together we can, you know, with
some mild technical details that we can talk about offline,
examples and constraints are essentially a
humongous conjunction.

This conjunction contains
the entire training set, but it also contains rules of
the kind of this type, right? This gives us an idea
for a learning goal. This is a statement that
we would like to be true. In other words, among all
models that can exist, we prefer models that
set this conjunction to be true, because this
is essentially all my data and all my constraints together. Among all models, I want models
that prefer this to be true. But I can't search all
word models directly. So instead, I can just
literally apply the idea that we saw before,
take a t-norm relaxation of this conjunction. And I want that relaxation
to be as high as possible, because among all models, I want
models that can set this thing to be as true as possible.

That gives you this– that gives us a definition
or the beginning of what we are calling
an inconsistency loss, or a family of
inconsistency losses. Essentially, an
inconsistency loss is a constraint-driven loss
that is compiled from rules. If you have
universally quantified constraints, what you get
is actually a regularizer. If you have labeled
examples, we have something that penalizes
mistakes or penalizes models that make mistakes. In fact, a very
cool result that we can– that's actually
pretty easy to show is that if we use the product
t-norm with labeled examples, we get cross-entropy loss. So this is,
essentially, another way to invent the
cross-entropy loss. I tried connecting hinge
loss also to something else. But something didn't work out. Anyway, so the nice thing is
at this point, what we have is we just have a loss function
that we have to minimize.

This is just a
continuous function. It's a differentiable function. We can throw it at any– we can use any neural model. We can use any library
and any optimizer. So all we have to
do is systematically compile statements of that
kind using a particular t-norm into a inconsistency loss. So this is all I'm going to
talk about in consistency– about the technical details. We can discuss this
offline if you want. In the 4 or so
minutes I have left, I'm going to rush through
a bunch of experiments. So the question really
is, do these things help neural networks? I had– the papers have results
on reading comprehension and chunking. But I'm just going to focus
on natural language inference. The first case study
involves this idea of augmenting a model,
where we have a model. And we have modified it
using some constraints. We trained the decomposable
attention model on the SNLI data set. And there are two
constraints, both of which use the premise and
hypothesis words that we have, so PI and HJ.

One constraint says, if
two words are related, they should be aligned. For any PI and HJ related of
PI and HJ implies, align them. The second constraint is
for any– this is the one that– the example that I had. If no content word in
the hypothesis is aligned to something in the premise,
then the label cannot be entailed, because there is
no sharing of information. These are the only two
constraints we have. They are both easy to write. And honestly, we didn't
spend too much time thinking about them. These are literally the first
things we thought about. So the question that I'm
interested in, in all the experiments, is, what is
the value of these constraints? I'm arguing that constraints
and rules and logic, whatever you want to call them,
are more helpful when we have less data, because
that's the situation where the neural networks
have to somehow get the signal from somewhere.

So the experiments essentially
use different percentages of the training set. The first bar here is
the original model. The second bar, the red
one, is with the constraint. And a few interesting
observations– the first one is constraints often
help, almost always, especially in the
low data regimes. When there is– when the
training set is small, the improvements are
actually pretty high. Also interesting is when we
have a lot of training data, maybe you don't
need the constraints because the training data just
somehow gets this information.

So one result here is if
you have a huge data set, maybe just believe the data. If you don't have a
lot of data, maybe start thinking about these
sort of inference mechanisms. Another experiment involves
consistency of natural language inference. In this case, we trained
BERT-based models on the union of SNLI
and MultiNLI data sets. And there were two kinds
of constraints here. The first one, we call them
symmetry, which says, if– you have two sentences. The first– if the
first one contradicts the second, then the second one
should contradict the first.

Seems like a reasonable thing. And notice that in the
traditional world view, these are two
separate predictions. Nothing forces any model to
mark P and H to be contradict, and H and P to be contradict. It's possible that
this could be violated. And then we had four
transitivity constraints of the form that we saw before. If P entails H, and
H entails Z, then P should entail Z. All
of these are universally quantified, once again. In order to evaluate, we have
two kinds of evaluations. First is we can measure the
accuracy of these things. The second one is we can
ask, how often do models– trained models violate
this kind of a constraint? So we can ask, what is the
inconsistency with respect to some sort of a rule? So if you have a– I'm defining
inconsistency essentially as a fraction of examples, where
the left-hand side of the rule is true, but the
right-hand side is not. So I have a rule that says L
implies R.

And if L is true, that's really this circle here. And if R is false,
that's this region here. That's that. Then we can measure the
fraction of those things. And that is a measure
of inconsistency. One thing that we found
that's interesting is no matter what we did, the
accuracy of the word models don't improve. They're pretty awesome.

However, the models are
pretty lousy with respect to their consistency. So the symmetry, which is
a contradiction of A and B implies contradiction
of B and A is violated like 71% percent of
the times if we train a model on 1% of the data. And even if we throw
in 100% of examples, which is like, I don't know, I
think half a million examples or more, it's still pretty bad. It's like 60% of
the time, the model does not know that if one
sentence contradicts another, the other one should
contradict the first. Adding the regularizer
in all cases– that's this line here– helps. The transitivity
consistency is actually not violated that much. It's comparatively the
numbers are smaller. But even that's
surprisingly bad. I would've thought this is
something in the model should know this information. And the good news, again, is
adding this regularizer helps. So two takeaway lessons. The first one is just
adding more training data does not help with evaluations
that focus on a broader behavior of the models, on
how models interact with even its own predictions, or maybe
with other models' predictions.

The second one is
the systematically introduce regularizers
actually reduce these kinds of inconsistencies,
which is nice to see. So in the minus 2 minutes I
have, let me start wrapping up. I, you know, meaning
representations are hard. And neural networks treat
meaning as boxes with circles in them, like this thing here. And I would like to point
out that maybe logic can be seen as the language that
connects representations of that kind, the
structured kind, with these vector
value representations.

And this could be a channel
for introducing information. I really like that quote
from Stanislaw Ulam, I think, which when I found that,
I was thinking, yeah, the whole thing is just bogus. Anyway, just to wrap up,
logic's a convenient language to convey information to models. It would be nice if
neural networks and logic can play nice with each other. And one way of doing
that is to take logics to the neural network
land by relaxing them. Once you get these relaxations,
we can systematically augment the model. Or we can modify how
the training happens to make models more consistent. Either way, I encourage you
to think about consistency of models and
self-consistency of models, because maybe just measuring
accuracy is misleading us.

All right, I'll stop now. Thanks, and I'll be
happy to take questions. [APPLAUSE] Yeah. AUDIENCE: For the
transitivity constraint, you need, like, three sentences. But I guess the data set,
the sentences come in pairs. VIVEK SRIKUMAR: Yes. AUDIENCE: So how did
you instantiate those? VIVEK SRIKUMAR:
So good question. So first of all, all
these constraints– I have two answers for that. The first one is that
all these constraints that are universally quantified
don't need labeled examples. I can just go to the
wild, because it's true for all triples. This is essentially
a constraint that is true for all three sentences.

So it is universally
quantified, which means I can, in theory, I could go out there
and pick three random sentences and force this
constraint to be true. In practice, that's
not a great idea, because if you pick
three random sentences, chances are
everything is neutral. And so the constraint
is [INAUDIBLE].. So we went to, I
think, [INAUDIBLE].. And we had three sentences
for each– multiple sentences for each image. And we picked three
of those, which are kind of close to each other. And one interesting
thing is that I strongly believe if you choose the
right kinds of sentences that you write this
regularizer rule or input, the impact could be higher. And that is connected
to, for example, work from Sebastian Reel, where
they tried to find examples. And using the
Goodell t-norm, that gives a natural way of doing it.

But, yes, in theory, yes,
it could be for all triples. Yeah. AUDIENCE: For the symmetry
rule, are there any issues with discourse effects
or, like, the hypothesis connect path, a pronoun that
doesn't get an interpretation when you switch the order? VIVEK SRIKUMAR: I don't
know if that happens often enough to be a problem. AUDIENCE: So this is
an issue, for example, in a [INAUDIBLE] data set. VIVEK SRIKUMAR:
Right, right, right. I'm not sure if it is an
issue in the SNLI type world. But in theory, yes, that's true. And, in fact, that's one of the
reasons when I was describing this constraint to
someone, they said, oh, but that's not going to be
true, because the way in which these sentences are
laid out might be– the ordering might be fixed. Yes, but I don't– SNLI is simple enough that
it doesn't happen in a– that's not a problem.

Yeah. AUDIENCE: Yeah, so you mentioned
that, like, in the full data case, it looked like if
you were dealing with, at least in the numbers you
showed, like more than 10% of the data, incorporating
these constraints, like, decreased performance. But it seems like that implies
that, at least for the metric that you were
showing, like using the full data without
the constraints, like, the model increases
its performance in part by violating these constraints? VIVEK SRIKUMAR: Quite possibly.

These constraints are– in fact,
I had a slide here that said, these constraints are
really just suggestions. AUDIENCE: Yeah. VIVEK SRIKUMAR:
And it's possible that the model, maybe
the data violates these constraints, which is an
interesting thing in its own. AUDIENCE: Right. So– VIVEK SRIKUMAR:
The– can I just? One other thing about that is
that the way this whole thing is set up is that
the model is then allowed to violate constraints. As a result, it's
possible that the model learns to violate the constraint
or ignore it because these are all tough constraints. So maybe the– so it's hard
to say whether the data is problematic, or maybe
the model has just decided to ignore
the constraints. AUDIENCE: Yeah. I guess for like
the two constraints that you listed, though,
like both of them seemed fairly reasonable.

VIVEK SRIKUMAR:
Oh, actually, no. I'm glad you think so. But– [CHUCKLING] AUDIENCE: OK. VIVEK SRIKUMAR: Yeah, so both
of them are fairly problematic. AUDIENCE: OK. VIVEK SRIKUMAR:
Yeah, the first one is this one, notion
of relatedness– how do we know whether
two words are related? We used [INAUDIBLE],, which is
neither sound not complete. So that introduces noise. Here, following
the previous talk, it's not only content words
that carry information. So it's possible that this
is too strong a constraint.

So in both cases, you
know, it's possible that violations of these
constraints are OK. I– when we first came up
with these constraints, I thought exactly as you did,
that, yeah, you know, this seems like a necessary but
not sufficient kind of thing. But the resources are
noisy enough that it's neither necessary,
nor sufficient. [INAUDIBLE] AUDIENCE: Have you
thought of trying to train these on the old RT data sets? VIVEK SRIKUMAR: On what? AUDIENCE: On the old RT data
sets, the very small ones? VIVEK SRIKUMAR:
Yeah, good question.

I definitely thought of it. AUDIENCE: They're much
smaller than the 10% of the– VIVEK SRIKUMAR: Yeah,
in fact, interestingly, the old RT data sets are exactly
roughly about as big as the 1%. And I think I evaluated
on it but did not train on it because we
had some trouble finding the training set for reasons
that are not technical. There was some methodological
issue or something that was not– nothing problem– nothing conceptually
impossible or that. AUDIENCE: [INAUDIBLE]
if there were no aligned words, then it
couldn't be entailed, or maybe [INAUDIBLE]? There's a subtlety between the
SNLI definition of entailment and the old definition
of entailment that you could have possibly two
sentences that are unrelated.

And they would still be
marked as contradicting, because they assume that
the two sentences must describe the same event. VIVEK SRIKUMAR: Right. AUDIENCE: While in
the old RT days, it's this rule would be true,
I think, for most examples. VIVEK SRIKUMAR: That's right. So this is an interesting point. In fact, I think the fact
that you're thinking about that point makes me happy,
because in some sense, we want to be able to use
these kinds of signals directly and tell the models.

And in some sense,
what you just described is two different constraints
for two different definitions of the task. So, yes. I don't know who– AUDIENCE: I'm just curious how
adding a predicate that there's an impact? VIVEK SRIKUMAR:
It didn't matter. So for these ones, for– let's just go through
both of these. For the first case, when we
are augmenting the structure, we are essentially– nothing is
changing in the training time, because there are no new
parameters to be trained. The model is almost
the same size, so it doesn't matter so much. I mean, the network is
almost the same size because we are adding
a very few extra nodes. Here, we are adding
extra regularizers. So we have to– that changes the training time,
because the extra regularizers are [INAUDIBLE] to
huge summations.

That increases
the training time. But it does not increase
it as much as, say, had I had inference during training. So it's still pretty reasonable. If I remember right, it
took a couple of hours on a GPU that's a
couple– a few years old for training the
1% model, for example. It didn't– it was not bad
enough that I had to think about it. Yes. Yeah. NOAH SMITH: OK, So I think
we're basically out of time. And Vivek and Nathan are
around, in town at least to the end of tomorrow. So send them a note
if you didn't get time on the schedule.

Maybe they would
hopefully [INAUDIBLE].. Thanks to both for a great talk. [APPLAUSE] .

As found on YouTube