Success

NW-NLP 2018: Ben Taskar Invited Talk; Learning and Reasoning about the World using Language

Google+ Pinterest LinkedIn Tumblr

>> So, it's my pleasure to
introduce the Yejin Choi, who's an associate professor at the University of Washington and also a Senior Research
Manager at AI, too. She's done a number of really interesting research
projects over the years. It's always fun to watch
what Yejin has been doing, including things about
common sense knowledge or Language Generation and more recently about AI
for Social Good.

She has had a number of really impressive
accolades including like the top ten
to watch in 2015, from IEEE AI and recipient
of the Marr prize in ICCV and also a member of the team that won the Alexa prize
challenge in 2017, so lots of amazing things
to watch for. So, well, that's probably enough. >> Okay. All right so, speaking of
Alexa prize challenge, I will start by sharing
brief experience with this. So, the goal of
this challenge was to create Conversational AI that can make a coherent and engaging
conversation with humans. The good news is that
the team that I was part of thanks to the amazing
students we won the competition and it felt
great to be the winner.

The bad news is
that Conversational AI remains to be more or
less unsolved, so, especially to our surprise we were not a surprising
factor at all is that the winning recipe
was not based on neural networks with
brute force more data, deeper networks, and spiked away the reinforcement learning,
that just didn't work. So, now that's curious
why would that be, because we thought that
neural networks to do amazing things like
superhuman performance on object recognition
or image captioning.

A lot of industry replacing their long-standing statistical machine translation systems with neural network models. Then of course
a special recognition has been working really well and even more recently
human-level performance on reading
comprehension, and wow! and, all these are based on very large amount of training data and then
sufficiently deep networks, so, this should work. However, if we look
at closely there are significant performance
gaps across different types of tasks. So, first of all, nobody's really reporting on superhuman performance on making a conversation or summarizing a document or composing an email on behalf of me or
identifying fake news despite the fact that we have a lot of data for
all of these where if not me at least companies do have a lot of data
for any of this, and they do have
a lot of GPUs too. So, why is it that we don't hear about this from the news? Even for those applications
for which neural networks do perform at super-human level
for some data sets, they usually are not very robust if given unfamiliar out of the domain or adversarial example as Jonathan was talking
about this morning.

So, what's going on here? I'm going to argue that in
fact that's because there are very fundamentally
two different types of applications out there, and we're primarily seeing
the advancements for type one type task where shallow understanding
can get you quite far. So, an example is translating
bananas are green in English sentence into a Polish sentence banana
[inaudible] For this, neuron networks are very good at learning word level or
phrase-level pattern matching so that it can learn to translate from one language to the other without
really understand much about a particular language.

Whereas, if I were to
make a conversation and tell you bananas are
green you might tell me, "No they're not" or you might
tell me "They're not ripe". If we try to learn the mapping function between input output parings this way, this is only a very
weak alignment between the input and output, and in fact we need some sort of abstraction cognition we're reasoning in order to really
know what to say next. So, which in turn requires knowledge especially
common sense knowledge, and ignoring this factor
all together and try to learn this mapping function just doesn't make a lot of sense. So, to make another point about how important it
is to be able to read between the lines and let's think about news headline
cheeseburger stabbing, if you give this to state
of the art parser today, it's going to tell you that cheeseburger is a noun modifying.

Another word is stabbing
which may be a noun or verb depending on which parser
you use, but doesn't matter. The fact that one word modifying
the other doesn't tell us whether it means someone
stepped a cheeseburger, or a cheeseburger step
to someone, or a cheeseburger stepped on another cheeseburger
and so forth. You know what may have
been the actual meaning of this news headline
because you're able to fill in these missing
information from the sentence so that
there are probably two people involved in this act, even though they were not even mentioned in the original
news headline. When we do that, it's the common sense
knowledge that we rely on, especially physical common sense is that it's not possible to step somebody using
a cheese burger because it's too soft, and stabbing someone is
not good and it's immoral, so it's more likely to be news worthy whereas if you step
a cheeseburger who cares. So, when we look at and think about different types of knowledge in fact there has been tremendous amount of research before
learning and extracting encyclopedic knowledge
like who is the president of which country
and born in what year, but not knowing this I can make a failure okay conversation.

However, it's really
common sense knowledge that we need in order to
make an okay conversation. So, these neither
physics type knowledge or social norms they have
not been studied very much, and if you look up ACL anthology, so this is as we all know
the repository of NLP papers, most of the papers that
they're to mention the word common sense are either from 80s who are from
the past few years. So, nothing in between
complete void.

I think it's in part
because there was a tremendous failure with
a common sense in 80's, but recently some saw that people start to forgetting about
the path to failure. So, there's been a few very
interesting data set one of which from Microsoft is whether rock story
common sense data set. So, for all of these tasks basically just
brute force, more data, larger networks
don't really go very far and we somehow need
different game plan.

So, to be honest I don't really know what they should be but for lack of any better
idea I'm going to talk about two different
directions, a, thinking about
neural network architectures so they can better model the latent process
in our mind when understanding and
reasoning about text, abstracting away from
the surface patterns. Then, simultaneously
we might want to think more about
representation formalisms, orthogonal to neural networks, and really think about how
can we possibly organize and learn common sense knowledge using language as scaffoldings. Now, usually in this spectrum I talk about these three
pieces shown here, but the second one will be
described by Antoine today. So, I'm going to only
give you a teaser for that and instead talk about dynamic entities in
neural language models. So, let us begin
with the first part. So, you know checklist the model, was originally
designed to to think about particular kind of
language generation challenge. In particular, there are three different types
of generation setting, case one is a small
input small output like machine translation, case two is a big
input small output like document summarization, and these are two cases
have been studied a lot more in the past.

Whereas case three is
the case when there may be only very sketchy input
small input and then suddenly we have to
generate the bigger output. For this there are
two unique challenges. A, there's an information gap. So, the machine has to have
a creative power to fill in a lot of information that was not even quite in
the input, and then, the second challenge
is that the moment you start generating
more than one sentence it's very easy to make problems, so, coherency becomes
a big challenge as well. So, a little bit
more formally the task that we're going to
think about momentarily, is that there's going to be
an input which is a title, and an agenda which is a list of items that you
want to talk about. Then the output
will be a paragraph or a document where the multiple sentences that tries to
achieve the goal implied by the title and then
tries to make use of only and all of
the agenda items.

For now, we're going to
use a recipe generation as a running task
running example, but later the model
will be applied to this dialogue response
to generation as well. So, at the beginning we thought
that in order to generate recipe wish should
first to parse it because that's what a lot
of people do in NLP. Once we parse it, we are going to abstract away some prototypical
graph structure out of it from which perhaps we can sample a graph which
is like it's explaining, and then convert to
that graph down to raw text.

So, there was our original plan, and we even worked on
the first segment and managed to write a paper about parsing
recipes into action graphs. But then you realize, we realized that doing this entire so code is quite hard because errors are
start to propagating. It was really unclear
what to do with this. I'm still wanting to do this, but in the meanwhile we were wondering whether
we could somehow simplify the whole process. That's when we heard about encoder-decoder
architecture in neural network literature
which apparently work great for a of the task
like machine translation. So, let's encode
the recipe title into vector, and then decode the entire recipe out and see what happens. Sausage sandwiches. So, this is what happens. I clicked on purpose
worse example however, recurrent neural networks tend to repeat itself a lot like
a sandwich sandwich.

Eventually, it overcomes
the sandwich overdose but still things mostly be
repeated that listed twice. So, why does this not work
very well for us when it does really work really
well for a lot of other task? Well, it doesn't work well when there's only a very
weak alignment between the input and the output. That's what I was trying
to tell you all here about different input
output comparisons in this particular task only six to 10% of words in the output came from
the input which means 90-94 percent of the time Apple came out of nothing almost, or almost that way for the neural networks, or
encoder-decoder architecture. So, the rest has to come
from somewhere else, and neural network
the off-the-shelf models don't operate very well. So, with this in mind, the high level idea is that is it possible to compose a
neural network architecture that could have this mental map of different items that you might
want to check as you go? So, that's what we did with
neural checklist model.

But since this one is
a little bit hard to stare at, let's look at a much
simplified version as follows. So, we're going to make
garlic tomato salsa based on this visualization where we
encode the title into vector. There's a checklist that
has a list of agenda items, and then at each time step, this is one unit of
some recurrent neural network. In our case it's a GRU but that particular choice
it doesn't matter. So, internally, it's going to make this three-way decision, whether I am about to
generate a word that is non-ingredient word or
new ingredient word or none of the above.

So, let us see how it works. So, let's just say so far we
generated the "Chop the". Internally, the model is going to look at
its own context and decided that it's time to
use one of the new items. So, it's going to do softmax
over all the new items, and then that just say tomatoes
got the highest score. So, we check that off from the checklist and move forward. So, the second sentence might start to weigh
the "Dice the". At this point the network
might internally decide that, oh, let's use
another unused item, and softmax or do the probability production
only over the unused part, and let's say, onions were
the second-best choice, and check that off from the
checklist and move forward. Now, something interesting
is about to happen. So, "Add to" at this point the contextually it's the moment when you go back to one of the already introduced
items that's just how the discourse accused
are indicating to us, and that's also what
the network picks up as well.

So, "Add to" what should we use between
tomatoes and onions? >>Tomatoes. >> Tomatoes. Because that's what was introduced earlier on, and the discourse convention
you said that that's what we do mention and
natural glands to do that. So, in fact all this checklist is probabilistic and
were black and white. So, it's based on the ten the cumulation
of attention scores, which reflects how much the model thinks that it made use of
a particular ingredient. Then the three-way
classification is also soft in the sense
it's an interpolation of the three different
language model like components where the first one is generic
language model like RNNs, or GRUs, or LSTMs.

Whereas the other two are based on attention mechanisms looking at either used the portion or unused portion of
the agenda item. So, we did comparisons
automatically, and of course the story is that we do better
than other baselines. So we borrowed machine
translation majors, and when you just look at this number looks
really very low. So, you probably want
to see actual samples. So, I'm going to show that. But briefly this one
shows what fraction of agenda items the model
is making use of, and the regular
encoder-decoder architecture with or without
attentions practically ignores what's given in the
input because they are not very when the input
output alignment is so weak they're not very good at actually learning the mapping. So, mostly they're
ignored whereas checklist can make use
of much bigger portion. So, example. Skillet chicken rice. Baseline. That's regular RNNs. In a large skillet, brown chicken in oil.

Add chicken and rice cook. Add more rice and broth them
cook some more and so forth. So, stirring more rice, stirring more rice, keep cooking. And probably it's edible, safe to eat in the end, but somehow there's not
much of a coherency so checklist on the other hand
does much better job. So, in a largest skillet heat
rice and onion. Add carrots and mushrooms cook. Stir constantly until
hot and bubbly. Stirring seasonings,
cook some more. Stirring chicken and
then cook some more, and then serve over rice.

So, overall the coherency suddenly become much better even though in this model
we are not even doing any fancy modelling of
discourse in any measure. But it was only this checklist that is external
memory structure that allows recurrent neural networks to better manage what's happening in
the long-term context. And that's going to be a repeating message in
the first half of this talk. So, works really well when
the input is a familiar and apparently skillet chicken rises quite common recipe
in the dataset.

So, and this is a human recipe, it's probably
a little bit better. So, let's look at
the failure case. Chocolate covered
the potato chips. Now, never heard of this before. And baseline, preheat oven, probably a good idea
but then grease, flour a pan, start
baking it right away. So, because input is on formula it doesn't know
what to do with this. Just ignore it. In
the case of checklist, it's paying more attention. It also does weird thing like suddenly bake
something right away. But then you realize, "Oh, I haven't done anything
with chocolate, so let's melt it." Then it says silly things like, "Add the potato mixture
to the potato mixture." Which is grammatically correct, it's just a little bit silly. Eventually, it's going to
fry that in the hot oil and I don't think
it's a good idea to fry something covered
in melted chocolate.

So, that says something
about AI safety. I'm not joking here. So, the fundamental problem
with neural networks today despite all this
amazing results is that they struggle for
unfamiliar input, and as a result, we cannot really
really trust them. So, human recipe is much better. I'm going to come back
to this point in a bit. But let me very briefly mention how the same model was able to achieve better performance on a particular dialogue response
dataset as well. So, moving on to the teaser of the neural process networks which I'm going to give you
a motivation for. So, this is another example. ''Deep-fried cauliflower''
that was the title. And then neural checklist the model generate the following. Wash and dry the cauliflower. Heat the oil in the skillet, and fry the source until
they're golden brown.

Drain on paper towels at the source to the source,
and mix well. Serve hot or cold.
What's wrong with this? Yeah. Cauliflower, it's a steal. It's a clean, but
hasn't been cooked yet. I guess cauliflower is
okay to eat raw but. So, when I gave this talk about Neural Checklist Model
at Harvard University, a student asked me
after the talk, are RNN's a mouth
without a brain? I wasn't sure how
to respond to that, I guess I agreed with her. So, we need
common sense knowledge to reason about this sort
of unfamiliar situations. Humans can still do
the right thing whereas, we don't know what's
motions we'll do, neural networks will do. So, that's the motivation
like can we think about new architecture
that could read between the lines
and reason about the unspoken, but
obvious effects. For example, Fry tofu in the pan. Just common sense
wise, we know that, the location of tofu must have been the pan and
temperature of tofu must to become hot even though the sentence doesn't
say any of this.

So, in a way it is similar to
how we might be able to do mental simulation of about a lot of things when we understand the stories and understand the others when making
conversation with them. In fact, there have been some literature in cognitive
science or psychology that, that's what humans
seem to be doing. Exactly how, is always debatable, but we seem to be doing
something like this. Can we then challenged our self
by asking this question, instead of this question. So, a lot of NLP research
has been based on labeling. Let's label every word
in a sentence with a syntactic and
semantic categories. That has been really very useful, but it focuses by-and-large on what is said in a sentence. Can we also think about, in addition to that, simulating the
causal effect that's obvious but not spoken, and abstracting away from
the surface strings. In fact, there have been some recent efforts
of that flavor, but of course when
people write in papers at the beginning, no paper will solve
anything completely.

So, we have another one coming up, Neural
Processing Network. You can hear more about
this in a bit today, but I will leave
it there and then talk about the next component, Dynamic Entities in
Neural Language Models. So, lets work with
other colleagues at Youdot. You know she was
here, she left today. So, the problem with
neural networks, this is another sample
that I found it from a paper that introduced
LAMBADA dataset, I think. Human says, "What's your job"? Machine says, "I'm a lawyer". Human says, "What do you do"? Now, machine says,
"I'm a doctor too".

Somewhat understandable,
if you think about how I embedding here, and I embedding here. When it starts with
a word embedding, it's studying like
it's the same word, it's the same word. So, Jane and Jane Eyre, maybe is same as
Jane in Mentalist, in terms of how they start
with neural representation. Then, when I say
she to mean Jane, now suddenly, they kind of
look like different words. In addition to that, there's this philosophical
question of, "Is it possible to encode very large context into one
vector"? Ray said, "No. It's not possible". I don't know, maybe some of you in
the audience were very young babies have
not known about this, but he said that.

I'm going to just
repeat what he said by adding that nor into
a sequence of vectors, it's just not enough. So, recurrent language model, has a sequence of
vectors basically, and our model is to add more external memory
that can have a better long-term
representation, as follows. Let's just say, these
were the sum entity, and then I'm going to create
this separate memory chain, which is also recurrent
for some time. Whenever I see a
new entity mention, I'm going to create a new chain, and they are going to be just copying
themselves over time, until there's a co-reference to the same blue entity at
which point I'm going to now combine the
context coming from the recurrent neural network with whatever I was storing here, as a separate memory chain
and, so forth. So, that's the gist
of this model. Let's have a separate chain
in which we store a little bit more precise
information about different entity mentions corresponding to the same entity. In doing so, we are now having recurrent entity model in addition to the recurrent
language model.

Whenever we need to compute new entity embedding
in this chain, we are going to interpolate that between whatever was coming from the Neural Network Language Model part and then whatever
was stored here. That interpolation
is based on gating, so, that's a fairly
standard thing to do. Gating can be parameterized
in many different ways. This is a particular way
of parameterizing it, but doesn't have to be this way. So, once we have that, then, what can we do with this? Well, usually when
you sample words out equal in sequence in the language model
part like this, and when you do that instead of conditioning on just hidden state of the recurrent
neural network part, now we can condition also on whatever is relevant entity
for that part. By doing so, we just to have
more precise conditioning on the context to that is more relevant than how we do that will be
impossible otherwise. So, in some sense, feed forward on neural network
with two layers, you should be able to
approximate anything in theory, but nobody can train it
to do anything useful.

That's why in the computer
vision community, we will be seeing a lot of
our new architectures like, Radio.net, DenseNet,
CondenseNet and, so forth. So similarly, I think we need to do more research like this. This may not be the best
architecture yet, but the kind of
architecture that allows us to carry on long-term
contexts more precisely. So, that's a sort
of the intuition.

I'm going to give you
just a very brief intuition about what we do for
training and testing. We went with this idea of
a generative model, such that, for each word there are a bunch of bookkeeping variables
that we create, such as a boolean
variable that ask whether part of an entity or not. If so, which entity do I belong to and then what's the length, and there's this that decrease over time
until it gets to zero. It's one way to do that, there may be different
ways of doing it. Finally, coreference relations. This one just shows
how these variables might refer to
different information. So, in this world, we'll just assume that we have this perfect coreference chain that was given from the sky.

That's probably unrealistic
assumption to make, but as a first step
that's what we did. Then, during testing of course, we seemed a little bit less cautious to assume
that for testing so, we just sample and
marginalize them out, using importance of sampling. By doing this then,
what can we do? We can then think about
different used cases. So, we first of all looked at just using these as
language model that may be able to marginalize
out different types of coreference decisions and
then in that case it can lower the perplexity on
the corner test dataset. Then in the second case, we also use the days in order to improve a certain coreference resolution system that we could piggyback on by adding our model scores as additional scores to
help with the ranking. So, we plot our system to ranking based system and we were able
to improve the performance. It's not state of the art because people keep improving
on this quite a bit and our colleague at Youdot's look has so much stronger results now.

Then finally, there was
a very interesting new data set that was a little bit
common sense inspired, and it was about predicting what entity
may get mentioned next. Then, for that task goal, so we were able to perform much better than
other strong baselines. So, that's another story that repeats this theme
of or spirit of having new neural network architecture
that might better represent the latent process that goes in the way text is written, where text is understood. Then, now I'm going to switch the gear
quite a bit and talk about this VerbPhysics
which is more on this side. So, this work is motivated by this hope
that in the future, we may have a home robot
that should understand a lot about different
household items. So that it can
interact with them. You should have understanding
about relative size, way to reach the tear
strength and so forth. So, it's good to be aware that usually
people are larger than a chair and if I were to
look this up on Google, nobody says this, but now I said this so
you can search this, but when I search today
for the first time, I couldn't search
that or any of this, because nobody says any of this.

It's trivially true, so this is a known factor as
a reporting bias that people don't state the obvious and then
when finally people say something it's exceptional case. So, we must not conclude that the horses are similar
in size to dogs. So then what do we do, I thought that the answer
must be in computer vision. Let's look at images, so we worked on this and one of the papers
looked at lots of images and estimated
relative size differences, adjusted where the depth of the differences and then
we were able to learn some knowledge like dogs
are usually bigger than cats but the takeaway from this work was that I thought NLPs are very hard but computer
vision is also hard.

In addition to that, computer vision just takes more computation because
images are bigger to process. So, after processing
lots of images, I only got like hundreds
of knowledge nuggets which seemed a kind of small and
then in addition to that, size, we can measure, but what about relative weight or strength differences
or speed differences? These are visual to human eyes. Not really for computer
vision yet, to some degree, yes but nothing nor reliable way
that I could depend on. So, our revised plan
then is that, well, one could just wait but since we don't have
anything better to do let's go back to language
but with a different plan. So, key insight is this, even though nobody
says any of these, people do say that they
throw a pen or stone, chair, and so forth.

So, all of these
are possible when the agent is typically
bigger than the object, heavier than
the object as a result the object is moving faster
than the agent temporarily. So, the representation which we named as VerbPhysics
goes as follows, for it's very similar
to frame semantics, but now it talks a little bit more about pragmatic
meanings of language. So, for any pair of potential
arguments of a predicate, we can think about
what may be implied, so we walked into the house, may imply that, "I am
probably smaller than my house and lighter than my house and move
faster than my house, and if I squash
a bug with my boots probably the bug is smaller than my boots and lighter than
my boot and so forth." So, they are sort of a
long list of obviously, likely truth about the world that comes to our
mind and perhaps, we can learn to
detect that as well.

So, in terms of model, we decided that perhaps if we can solve two related problems simultaneously in particular fiscal properties
implied by verbs, so what kind of
actions to imply about different arguments
and then also in general different relative
knowledge in between objects. Across five different
attributes in this work, although one can extended it to include
other attributes as well, and if you've been waiting
for deep learning to pass, so in this work we didn't
use deep learning per se, although we did make use
of a neural embeddings. But otherwise, it's
a factor graph with lots of variables that have
a probabilistic interpretation. So, it goes something like that. I'm going to give you
a very sketched idea of what we're doing.

So, we're going to
throw in a bunch of random variables
that might mean that for each of them
it might mean something like p is either
bigger or smaller, comparable to q in terms of size. So, they can have these three different
value assignment and then on the other side of the graph we can also
throw in a bunch of random variables that
encodes action implications, again the value can
be either bigger or smaller or same and then they can be about a particular
predicate like throw with respect to a
particular attribute like size.

In fact, this predictive
variable is a collection of random variables
internally due to the fact that using the same
verb like throw, we can use in
many different frames, so we just use
a different random variables for different instance initiation of any particular predicate. But for now, we might
just assume that they're sort of like behaving similarly. Now, at this point though, there's no evidence about any of these random
variables truth assignment because nobody says
my house is bigger than me, and nobody says when
I throw something I am usually bigger
than the object, no one says any of this. So, the only thing that we
do observe from language is how people use
different arguments together with that
different actions, that for example I throw my bag as I walked into my house.

So, people do say this and that's the relational evidence is the only evidence
that we do have. So, translating
this intuition into the graph model is that
we are throwing in potential function
that quantifies the selection of
preference between a verb and its arguments
and then similarly a lot of potential function, see for object-object similarity as well as verb-verb similarity, as well as a
frame-frame similarity. Then finally, we can not suddenly decode knowledge
out of nothing, so there's a little bit of
received knowledge given as unary potential function so that we can reason about
the entire network. So far, we have been only talking about size but we can keep doing this until
the network gets very, very messy by adding this weight part and
other parts as well.

So, there's this
well-known inference, algorithm known as Loopy
Belief Propagation, that's what we did for making the inference and the conclusion. Very briefly is that the random assignment
would be one third, accuracy, majority baseline
will be about to 44 to 50. Whereas, if we just use neural
embeddings to reason about different knowledge without
the entire graph but just one note independently
from all the other, then that's how much we do, and then having
this graph inference can help improving
the performance further.

On one side we are looking
at frame predictions, on the other side
we are looking at object-object knowledge. So to summarize, we have explored to whether it's possible to reverse
engineer neither physics that knowledge
from language even though people don't speak
the obvious effects. But perhaps we can design
a model or inference algorithm that could still be
able to reason about unspoken but what we all
assume with each other, which systematically influences the way
people use language, which then give us some clue about what may be
true about the world. We can do this
without having robots that can have an embodiment
and interact with the world. All right, so now I'm going to conclude with some remarks about what future research directions
we might pursue further.

So as far as neural
network architecture, there have been
some debates about whether there should be
innate architecture or not. I'm obviously on the side of
having innate architecture. I think in the NLP community there have been more emphasis, relatively speaking
more emphasis on the linguistic
sentential structure where they usually are some explicit structure and it's a very, very well-studied. Whereas, the moment we
start thinking about document level or
discourse level or just generally long-term context or latent processes of
how the world works, things have become much less
clear how to handle this.

Nonetheless, I think this part is very important to proceed further and then as
various formalism, I think we should really start thinking more about like what to do with the common sense. I was told not to use
the word there for some time and I think it's because it was major failure before
but I realize at some point that it's
just nonsense to conclude that that direction is going
to be forever impossible when the past failures were based on we're computing power, not much data, no crowd sourcing and not much computational as a strong computational models, and also it was done by non-NLP people and I think with language we
can do a better job, or we should try before concluding that
it's not going to work.

So, in terms of
physical common sense, the potential impact cases
could be more with zero-shot/few-shot reasoning
cases especially when there's language and vision
and robotics applications. But we can also do
similar things with the social common sense which I haven't
talked about today. However, we have been
working something along this line with
connotation frames and also we now have a new
ACL paper coming up which I'm going to give
you a teaser about. So, it's going to be about
based on the rock stories developed at Microsoft in Rochester where the story
goes like this, "The band instructor told
the band to start playing. He often stopped to the music
when players were off-tone. They grew tired and
started playing worse after awhile and so forth." So, when there are these sort
of people involved in our story we can
oftentimes reason about what people
might feel like here.

The instructor was furious
and threw his chair. It doesn't say what
these guys feel now about their situation
but we can imagine that they must feel
fear or feel angry or sad even though it's
not mentioned here. So, we can sort of
have this annotation, very low-level annotation
about what people might feel, what people's motivation might be before and after
different actions and events. We have a bunch of
new annotations and baseline models reported
coming up soon. Then, related work also in ACL will be about
common sense inference. For example, if someone
coextensive giving dinner, probably that person's intent
is to impress their family. Afterwards, maybe
they will feel tired but still feel a sense of
belonging and finally, other people even though they are not mentioned in the sentence, they will probably feel
impressed or happy and so forth.

So, we can reason about
what people might typically do before and after in terms of
their mental states, so it's also coming up at ACL with a bunch of new dataset. So finally, I'd like to say that this new modelling power, those creates
some new opportunities for doing more human-centric
applications. So that people don't fear just AI as a negative
potentially scary thing. I think we could try doing
more positive x in the space. Personally, I think we should
work more on interactive AI because that's what
the robotics people really want as well.

Especially with
the neural network, it seems so that
we can do a lot of creative things in
texture generation too. So, I will just read in this sonnet that was
generated by a machine. So, Turing was the title
and it goes like this, "Creating some
electric slot machine, of hearing those
familiar voices say, 'Awake behind
an empty picture mean, the windows open over Alan Kay.' Consider me of
regular expression, or maybe something
very quickly typed. And I forget about
on old obsession, a hundred thousand people
getting hyped. Become the biggest part
of my computer, and take another journey
down the road. Or set a balance on
a minor scooter, or even matter whose enigma code. And music takes control
of all the means, surrounded by a world
of a strange machines." Couldn't fool anybody yet. Thank you. >> Thanks very much for
an awesome talk, Yejin. We have about five minutes
for questions. So, I'll run the mic. >> I have a question, I wanted
to know your thoughts on.

So, back on Vegas
previously there was work done for open site
or on a psych knowledge base, which was doing code by common sense knowledge
and the knowledge graphs. I was wondering
about your thoughts on the utility of using that for possibly distant learning. >> So, I think there have been some couple archival papers that tried to use
at least concept which is maybe is analogous to open site and tried
to see whether it can improve some
downstream applications like QA and Intel mount
and the dialogues. So, it's a great question. Personally, I didn't look at open site but I heard it from other people who did look at that that it's very much ease
or hierarchies. So it's a very
dictionary-knowledge or taxonomy-knowledge. Concept to net which
I did look myself, is also very taxonomic knowledge or predominantly that they do have a lot of
other relations that I really appreciate and
find very interesting.

Unfortunately, coverage on
those more interesting cases are a little bit
lacking so which is sort of motivates our new dataset that tries to cover like motivation is for example included in the concept in
that relation definition. However, the coverage
is relatively small and so it's one way to
increase the coverage. Eventually, it may be useful
but we'll have to find out. >> A supervise learning
using human annotated data that is the hope that
you can use this with the correct architecture
to create to help them the system
to be able to in general understand how to make
these types of inferences.

>> Yes. So, what we do
show in the paper is by using very encoder-decoder architecture,
basically almost that. We can learn to encode
any textual description of the event and then reason about what might happen to
people's mental states. Such that given previously
unseen event description in the test dataset. It can still reason about what's likely mental state
which is what we basically have to be able to do and if we are using
all the fashion, the logic based, the systems, then that wouldn't work. Then there has been knowledge
populated for everything, whereas with neural network
we can learn to compose the likely anticipation
instead of having to crowd source literally all of it which I think is
just not possible.

>> Maybe we have time for
one more quick question before the end of this session. >> So, how do you know
that when you're done? As in how do you measure coverage recall that you can measure precision but
what about recall? >> So, in this particular work, we have a sort of like test the set cases who were in
the test the set starts with this left-hand side of the logic and then we tried to predict
the right-hand side.

But I wouldn't be able
to say in terms of like recall with respect to everything that I ever know of. I wouldn't know yet. Maybe if we start thinking about there's someone will
come up with a number. >> Thanks once again, Yejin..

As found on YouTube