Fireside Chat with Christopher Manning

Google+ Pinterest LinkedIn Tumblr

you welcome Chris it's my pleasure to welcome you to Microsoft Research where you'll be giving a distinguished a I seminar later today and for the audience chris is the Seibel professor of machine learning at Stanford University where he has an appointment in both linguistics computer science you had the Stanford AI lab and you've recently become a part of the Stanford human centered computing initiative the high effort you wear many hats and one of the things that I've loved it about your career is its focus on statistical NLP ranging from you know from work on parsing to grammar induction to lots of applications in information retrieval and machine translation and stuff you're amazingly well respected and successful you have more than a hundred thousand citations I think two very popular text books you're a fellow of the ACM the triple AI ACL is that right so again your breath shows in terms of the venues and the many awards you've received I want to start by getting a little bit of insight about how you got started in in this this area you did your undergraduate degree at Anu the Australian National University and if your web pages is correct you majored in linguistics computer science and math so can you say a little bit about how you combine that computational perspective on language seemingly from a very early age sure yes I mean I guess from an early age I on the one hand like doing things with computers but I also sort of almost by chance saw a little bit of linguistics on the side of the structure of human languages was really a pretty neat thing and so that led as an undergrad from right from the beginning I did this combination of linguistics and computer science and math yeah so the funny detail there is I actually did this all in an arts degree because I guess the way things were at the ANU if you did site he didn't have engineering and if you did science you had to do physics and chemistry to let me go I did that I see some math and computer science fellow in arts and very flexible I have a similar background in math and cognitive psychology and again it was partly because I was fascinated that she could use computational techniques to describe things like memory and learning was there a particular course or project that you worked on that piqued your interest in language yes really my thinking linguistics was a cool thing kind of it happens from a chance thing in high school that have an intro linguistics text it introduced the idea of you could have the International phonetic alphabet and sort of use it to describe the pronounciation of any language in the world and then introduce the idea of phonemes of how that there's certain sort of there's distinctive symbols and languages so that even though there's finer grained variation how people pronounce things that there are distinctions account as the sort of set of sounds of a language and this kind of you know both from the background of all the irregularity of English spelling that you work on as a young student but also just sort of seeing this scientific approach to this study of language that they somehow struck me as all that's really neat and so yeah really actually my first semester at Anu I took the intros of linguistics class ok so yeah my my path was a little different I started out in math and only later discovered amazing applications but so I want to talk a little bit about the first of your books on the foundations of statistical NLP it's probably the first time that I read a you know a fair amount of your work one of the things that I really liked about that book and this was in 1999 I believe with Henrik shusui yes and I I really like the fact that it focused on corpus based work which I think was probably at the beginning of its its course and in NLP it also had a section which I really liked this was the same time we did the work on LSI on language learning lexical acquisition and so those two as were foundational chapters in the book and I also liked of course the applications to machine translation and information retrieval so there was a lot of overlap in your in that work which I have right here and some of my own work at the time why did you write the book what was it that was missing in in textbooks or that you wanted to get your head around as you wrote that book sure yes what's missing was easy that there was just about nothing of that sort so I mean really both Henry and I had come from the then kind of traditional background in natural language processing and linguistics where there were handwritten rule-based descriptions of lexicons and grammars and parsing was done in terms of those grammars and that there are some kind of you know formal semantics ideas of meanings representation and then we'd come into it started to see this world so there are sort of early explorations of one sort extant become available digitally in large quantities of seeing empirically what you could do with large quantities of text and because of that's at that stage having the kind of requisite computing was still a challenge a lot of things effectively happened a big research lab so really IBM Research and AT&T Research were the most prominent two places where it happened but we were both on the west coast at that time and so there was also a small group at Xerox PARC that was exploring that approach and so I guess that's sort of how we got involved in that at first and so we were both excited by those new ideas you know for directly writing the book I mean for me the motivation was essentially that after I graduated with my PhD my first job was at Carnegie Mellon University and because I had had a little experience doing statistical NLP work but really not very much they asked me to teach a course on that because that was now what the students wanted reflecting some of these recent developments and there wasn't any such course okay and so I started teaching a course and that was sort of an obvious leading to the textbook and so a bit before us there had been one short text that Eugene charniak had written on statistical language learning but we felt that the coverage of that was sort of fairly limited it covered some of the kind of core algorithms for hmm Zandi zfg's for taking and parsing but just about nothing outside of that so we wanted to write a broader book that talked about doing empirical approaches to NLP and looking at things that included lexical semantics and how to do statistical NLP work so I mean effectively at the time we were looking to write a book that complemented the existing in all people so I guess at that point there was James Allen's text on natural language processing and there was another one I was with Chris Miller and someone else someone in Mellish but they had a book on natural language processing could either prologue or and so at the time we really saw ourselves as this was sort of the complimentary book that represented the other half of the picture of the new work there's be developed I mean of course what happened fairly quickly is that statistical approaches to NLP took over the whole field and it started to become the book people used rather than a complimentary book right so you've talked a little bit at the beginning about in NLP being maybe a sequential box model where there was lexicon syntax semantics and parsing and semantics did this book start bridging some of those or blending the blurring the lines between those those formally distinct processing stages and language understanding I think a lot of what was being done in statistical NLP is that people were taking those boxes and doing things in it that could be done so I said you were doing lexical semantics were determining parts of speech you were trying to get a syntactic structure for a sentence I said that it hadn't really become anything that was unified as a whole okay so the to me one of the big insights in the book was that you could do a lot from the analysis of large-scale textual corpora which we're really just becoming available and you know fast forward 20 years we'll talk a little bit more in a minute about the you know the deep learning revolution yeah I think that was a real eye-opener and you know it was what I initially well in Rick and me but really lots of people got just so excited when these statistical NLP techniques or corpus based techniques started to be explored on large amounts of digital text because we had been brought up on sort of handwritten rules and lexicons and that language understanding was impossibly hard and AI completed tasks and this way which that's true but there was this very appealing way that if you took millions of words of text and in those days we were only doing millions of texts we won't get on to billions and you just did really dumb simple statistical stuff I mean effectively you just counted stuff that you could get all of this really interesting information you could build simple statistical models to do things like predict words sensors or to work out what arguments different verbs take and like that was like an appealing drug that there was this new way in which you could start learning knowledge of language just by doing fairly simple operations over body of text right when you think about acquisition of vocabulary or syntax in kids a lot of it is inductive right there certainly we are I guess we all at least when I was growing up with diagram sentences but you also learn a lot by listening and talking and and so that is very much the same sort of thing that you can do in in corpus linguistics so a few years later you worked with Daniel Ramage on some work that that I really found very interesting so it first looked at topic modeling applied actually to very very interesting work on the you know scientific literature but you also at that point started combining more structural or semantic information in some of the work on supervised topic models can you talk a little bit about the blending of those those two perspectives so there's a maybe the bottom-up inductive perspective and then there's we know something about the structure of language or at least this particular task yeah so the context of this was at that point I was involved in more using NLP for social science project and in particular we were looking at the development of multidisciplinary science and in some sense that's not a direction I've kept on with because some people have continued to be very involved in social science applications of NLP where in some sense I did it for a sort of a period of four five years I moved back into more core NOP areas but it was you know as soon as you're in the context of an application you want something that actually works and does something useful rather than something that is mathematically nice in the same way so preceding that work they'd been worked by David Blaine Michael Jordan on developing latent Irish lay application sorry latent the reach light allocation which was an unsupervised method of topic modeling which had captured a lot of attention because you could work in an unsupervised manner and take a lot of text and it would give you a kind of a Multi clustering based on the words and the hex which seemed a useful inductive way to get a sense of what a tax collection meant but the problem with it was that like a lot of clustering methods you got some sort of clusters of words and concepts but they had no labels they didn't necessarily correspond to how humans thought about conceptualizing the world and that there was very little way to control what came out apart from reinitializing oriented eyes again and seeing if you liked it better than previously and so what we were wanting to do is provide a tool that was more and more connected up with human conceptualization and more controllable and so we came up with this almost paradoxical idea which we called labeled lace injuries like allocation so having taken this unsupervised method and making it semi-supervised and so the idea was that in many human domains there's an existing classification so the first thing we did on was across science and so academia has departments or fields have journals and both give a classification of different areas and topics you know you can think of many other areas such as in the biomedical sciences or just in the industry groupings where there are these are these classifications and so our idea was we could use documents that are already been assigned classes to say well look here is a document this will allow you to learn about this class and you do that for thousands of documents and then we can run it on documents and say ok look at the language of this document which parts of a reflective of a particular topic where our topics now have names and so we can look at computer science documents and say well which of them are talking about biology or talking about linguistics or are using information theory methods by looking for those topics as appearance in a document by then running it as an unconstrained latent relay analysis problem so really it was in some ways combining the best of classification sort of labeled classification problems with category with classification and sorry clustering it so you put some structure on the problem and then took advantage of the the broader statistics as whatever in some ways it was a simple idea but yeah it's actually something that's proved to be very successful because I think there are just a lot of people in precisely that case that if they just ran an Lda it was sort of intriguing what came out but you've got these unconstrained and unlabeled clusters but actually there was already some kind of human classification of their world and that this gave them a better tool for working with their documents to sort of understand relationships in domains very interesting when I did some of the early work at Microsoft in in text classification we learned you know we had some label data or we learned classes and one of the things that the the team we were working with wanted it was the ability to save sometimes I don't care what your model says this item is in this class and so in many ways it harkens back to again combining some priors on the world and and more inductive techniques so the you know perhaps the the biggest change that that's happened in natural language processing in lots of other areas over the last decade or so has been the deep learning revolution yes and starting I guess 2011 ish when Jeff and Tintin and others showed how speech which was a well-trodden area could be improved dramatically and startlingly from my perspective using large amounts of data and unsupervised fashions to work in vision a few years after those were both interesting low-level perceptual tasks and one of the things that that has happened over the last five or six years is looking at the extent to which some of those amazing accomplishments that were driven by large amounts of data large amounts of compute and richer representations and richer models apply to language language is very different in many ways it's it's discrete I think machine translation was probably that you were involved in to some extent was what the first case where deep learning models influenced linguistic tasks beyond that it's been a little bit slower progress can you say a little bit about your journey and in looking at you know large thinking about deep learning in the context of language analysis and understanding sure yes so I mean it's certainly right that the early huge wins of deep learning and perhaps the domain in which deep learning is best suited or at any rate the ones where it was sort of seen to be applicable explored first ruin these perceptual tasks like speech recognition or in vision for object recognition or other kinds of signal analysis has already been used and so that's a very natural space where you'd like to have a powerful associative device that can look at patterns and work things out and transform them into some higher-level representation and the general deep neural network story having many layers which effectively increasingly abstract and get out a high-level signal is very natural so at first sight it indeed isn't quite so obvious that that's so well-suited to language processing when you were working at the higher level where if for instance if you have text or at any rate you already have sequences of words in the sentence because effectively yeah languages have these symbols which are words and those symbols are already concepts right that they're humans human built abstractions of what a useful must rings useful conceptual notions in the world and so this sort of base level of neural network processing seems like it's not needed but what people then started to sort of explore is that for many things symbols and aren't great in particular they don't have any natural notion of similarity the words and their meaning heavy notion of similarity but the symbols give you no entree into word and I mean that's something that had been explored a couple of decades earlier so you are you're involved in that so around 1990 there was a lot of work about could we convert words into a vector space representation which was then being explored by doing latent semantic indexing or latent semantic analysis where you're using linear algebra singular value decomposition techniques to produce a vector representation of word meaning and well you can tell your own story and I think it's true to say that at that time it was you know slightly successful there are some intriguingly positive results but it sort of somehow wasn't quite successful enough that it could displace traditional methods of information retrieval so really what happened around 2010 was going to be returned to those ideas but using newer and more flexible or more powerful neural network methods to come up with vector representations of word meaning and so when that emerged in the early 2010's that what I guess it was there'd been enough of a gap that seemed like a new idea again but it also immediately seemed a very powerful successful idea that it sort of seemed like in the first half of the 2010s that you could take any well worked natural language processing problem and add distributed vector representations of words I simply the fact that they gave this powerful notion of words similarity let you get an extra two percent performance on your problem and that's this kind of idea so let me follow up on that I the what we did LSI it was really largely to overcome the vocabulary mismatch problem where if the two of us use different words to describe very much the same concept like doctor and physician if they don't Co occur you're not going to find the relevant objects and so we were very much concerned in some sense about making zeros and a term by document matrix non zero and that results in an embedding I don't think we have broader linguistic notions than that and so the technique I think was particularly successful when you had short queries and short documents not so much if you have a full document with hundreds of words matching it against another document but it's you alluded to there I want to talk about your glove work with Pennington and so sure in the mid 2000 2014 and what what are the things that that did is it it brought together two perspectives so there was sort of the linear algebra or counting perspective on representing or doing in word embedding and there was a newer notion that was introduced by mikhalev and and many others on more predictive modeling so given some pre training objective can I learn a predictive model for language and what you did in glove is combine the two of those can you talk about a little bit about doing that the successes yeah absolutely and that was precisely now goal right that that on the one hand there'd been this traditional work which had calculated matrices of commonly it was word in document matrices but they'd also been a line of work which actually he had been quite involved in of doing word and word context matrices and it seemed like that was the information that you wanted to compute these distributed word representations and that was the right way to do it in some sense but instead what had happened was that people most famously at that point they'd been Tomas Michael offered colleagues word Tyvek model had said look we can learn these word representations by coming up with a prediction tasks and repeating it over a few billion times and we get these good word vectors and in the original paper not only was there no explicit formulation of the model as a model with the loss whatsoever but it was all couched in this sort of iterative prediction language of coming up with word vectors and we just felt that that was sort of wrong or missing the point because it seemed like well there had to be a well formulated model with the loss function and that what this was computing should be an approximation to what you could compute directly by making use of a matrix of counts of words and contexts and then building a model about that and so we were exploring how to do that and I think showing that connection between prediction models and matrix based factorization you know in some sense what we were producing was a kind of a generalized sigh you could say was the real contribution of the paper you know five or so years later I mean really in practice we produced a good set of word vectors that went alongside that paper and so we have thousands of citations mainly because lots of people have used our word vectors but I think the intellectual contribution is was combining those two perspectives let me touch on both of both of those at those points so the one of the things that that at least in my mind happened in that paper is that you took a notion of both global context sort of in the in the in the collection as well as local word context and it seems to me that the you know the notion of having some global constraints has been lost a little bit in some of the more recent work is that your sense is as well or do you see some of that carryover yeah I do agree that yeah you want to have both that yeah the global context is good for overall topic or theme and then local context captures much more local what is happening in this particular clause and the syntactic structure nearby yes and do you think that some of the more modern large-scale word embeddings things like Bert and GPT to have both of those perspectives or are they mostly that the pre-training tasks tend to be I'd say more local and in nature is that a fair characterization or do you think some of them are global perspectives that I mean there's there's so many different sizes of global but I think there actually are able to capture a quite a lot of that that since both of these models a transformer based models where the model for prediction can use as an attention distribution where it can look out that it's capable and the attention distribution of both putting a lot of attention on very nearby words which gives local context and then a kind of diffuse averaging attention over a much bigger con and so I actually think those models and part of why they're powerful is that they can do both of these things at once so that the global constraints come from a larger span yeah rather than a larger collection right and you invited Lee in do see the larger effect right so not global the Saints of a document collection yeah but large context in the context of an individual document yeah I think that can be captured good so I heard you'd give a keynote talk at CGI our in in 2016 when you you claimed that the IR tsunami would hit information retrieval in 2017 and it you know it certainly has I haven't counted exactly but my guess would be you know more than than half of the the papers at recent CGI ARS have have been and have used deep learning techniques to try to improve you know retrieval in many ways the what I but it that that talk you argued for for several things one one was that I are really needed to have deeper language understanding that we were at a point where we've gone beyond 10 blue links 10 blue links can cover up lots of inaccuracies because you let people do the the filling in of the the information but more and more people wanted to do question answering people wanted to work on mobile devices where you don't have the opportunity to present a page full of links and let people pick what they want how do you think that journey is happening how much language has been incorporated in retrieval models or deeper language understanding so I think you're quite a bit it's so it's still very much a journey that's in progress right that if you're making search engine queries there are clearly very many cases when you're getting no more than conventional information retrieval and it's giving you an order set of links but on the other hand I think what the big search engine companies are doing is actually very clever in that people aren't trying to say here's a full deep natural angle processing solution because the fact of the matter is that doesn't really work well enough yet and there are all sorts of context whereas isn't applicable and we see that in the sort of limitations of things like conversational assistance but if you can kind of gradually introduce it where possible that you can start to see if you can understand more of a query you can see if it seems to me the real question rather than just some stray keywords strung together and if it is to recognize that structure so that there are now lots of places in the big search engines where you can put in some structure and you can say tenth President of the United States and that's something that it can parse up and understand and know exactly the answer and give it to you or you can ask for your husband or Beyonce and get something back for the ad so it's sort of bottom-up starting to put in natural language understanding where it works and I think people indeed are more and more expecting that and clearly the transformative use case has been mobile that regardless of whether it's still thumbing out the query or it's spoken that even if you're typing out the query you've still got this teeny screen and you don't want to look at ten links and then start reading ten documents you'd like the answer and then even more so when people start speaking their queries then they're way more likely to be real sentences with questions and what people would like is to get the answers to them and you know I'll be honest that's still a work in progress I'm sure we've all tried now whatever it is whether it's all Alexa or Google assistant or Siri and half the time it doesn't work and it's actually pretty cool yeah I mean I think you're a hundred percent right that people's ambitions have changed you know twenty years ago it was startling to be able to type in three words and get anything that was sense at all and now we want to get specific information we want to have especially in mobile environments where you're speaking have a longer standing conversation not just one-shot deal and so I think you're absolutely right the as people's needs and become more refined the underlying technologies that are needed to support it really change you've and so that's a tremendous progress I think they'll change even faster in the coming generation right little kids who grow up today you know expect all screens to be touchable and all devices to be able to be spoken to and as that starts to permeate with those kids going up the pressure to actually be able to deliver on that vision I think will be really strong remember that the CGI are you know you said something like you know so far we've seen the benefits of large-scale language analysis and in information retrieval really restricted largely to word in word similarity and you argued very strongly for the need to you know to think a lot more about memory compositionality and really an inference and reasoning and you've started on that journey can you say a little bit about some of the things you're excited about in in supplementing simply knowing that doctor is related to physician with really richer ways of doing reasoning sure right so let me start off with the big picture and maybe at the end of information retrieval and surgery right so yeah so there was this enormous success of neural network methods in the domains we touched on earlier computer vision object recognition and speech recognition and it was a little bit surprising that you could take those methods and move them to some things that seem like high level tasks so something like language translation has normally been thought of a task that requires a lot of understanding of the world understanding of sentences but it proved that once it was proven that once you had a lot of translated data that you could take a kind of a signal processing approach to language translation that actually worked rather well and so really a huge success of deep learning for natural language processing has been the building of Newell machine translation models which really only started to be developed around 2014 and sort of within three years they were so obviously better than what we seated them that all the big tech companies have moved to using neural networks for machine translation so that was kind of impressive but you know essentially all working a I moved down to this lower level signal processing because it was exciting and successful and there are older higher-level problems of AI which is where you have knowledge and memory and you're doing inferences and you're making plans what Daniel Kahneman refers to as thinking slow where you're doing more inferential processors and in the 60s and 70s that was sort of the main thing that people thought about in artificial intelligence and not much of that has been was solved then and not much of it has been solved lately because to a large extent these problems were just ignored because there are these other problems that seemed approachable and exciting and so the question at the moment is well can we take some of these successors and machine learning and neural networks and start applying them to these problems and high-level cognition to be able to create the kind of flexible intelligence where you can take facts compositionally combine them reason and apply them to new tasks and so that's an area that I've been interested in for the last few years and have started doing work and seeing how can we try and build neural compositional reasoning engines and have been trying to apply them to understanding tasks some of it for language understanding I've also been doing quite a bit of work in visual question answering of combining vision and language and I think that that's an exciting area and there are interesting prospects to do that I couldn't honestly say that that's yet being applied back to information retrieval the world is not OK for me eventually it should be but you know they're hard challenges with information retrieval as to how to get the kind of scalability you can start running things on hundreds of millions or billions of documents but certainly have been starting to get results of being able to sort of use neural networks to do multi-step reasoning processes which seems more like a higher-level approach to cognition and that that seems an exciting direction to push so I can certainly see how you can bring in richer notions of memory maybe attention in in neural models do you think building and and you're starting to look as you said right now at looking at multi step inference and some kind notions of reasoning how do you think about that architectural II should that be part of the network or should the network contain things like words similarities and you combine those maybe in a fine-tuning process or where do you do you see one huge neck doing everything that that humans do or do you see different core components for example word similarity perceptual similarity these interesting questions and I think they're honestly still quite open and it's hard to know how things will pan out but I'll give try a few answers ok anyway I believe ultimately that we have to have models of intelligence which have different components in different parts and there is a certain level of modularity and you know while they're still very much about human brains neuroscience that isn't known I think it's never less completely clear that human brains also have different regions and have clear modularity you know there's some repurpose ability there are these intriguing experiments where you can rewire brains and have different parts of brain can learn to do vision or can learn to do auditory processing so there's probably a lot of flexibility and plasticity but there are also a lot of brain regions which seem to have clear roles and so at the moment there's sort of attention on the one hand in the last few years the most successful way to make progress in deep learning models I think in general certainly for natural language processing is to say I have one eventual goal whether it's question-answering translation text summarization whatever you want to be doing and what I want to do is have my eventual goal with the loss function and I'm gonna feed text in the same and I'm gonna build one huge into n neural network which may have some internal structure but nevertheless I'm gonna optimize this whole network end to end and that will maximize my performance and lead me to have this new high level of performance and I'll be able to write a paper and so end to end is very appealing and you know in some sense it might be right because you know we have our brains going around the world and things happen we see stuff and somehow things get updated and changed and maybe that is a kind of an end-to-end process but it also feels that a lot of the time that can't possibly be the right model human beings do all sorts of different tasks and it seems like they have to be different parts of your understanding and reasoning and processing which are repurposed for all kinds of different tasks and so it seems like they have to be these modular components that you can combine together in different ways for doing different tasks and that therefore it's not all trained into wind except perhaps in some kind of overtime gradual multitask process and so I think we have to be starting to move to a world that's more like that to produce sort of general intelligence and even just general language understanding ability rather than he is a new system that can get a better score and this particular dataset than the ones that came out last year so I actually grow a hundred percent with you and you made a comment I get again I think it was at the the Cir conference that you were worried about what you called I think the catalyzation of machine learning and it is interesting that the the field is very driven by benchmark datasets and we've both been involved in creating them they're really useful for some things but the way to succeed at those is to identify every bit of signal that is in one of those datasets and optimize it end to end and I think one of the interesting things that has happened at least in the last few years is that for the the newer models people evaluate not just on one benchmark but on many benchmarks and so I think it is important that we start thinking about that that kind of generalizability and there may be more that we as a community can do in terms of like I don't know how those benchmarks come up they usually come up because somebody wants to ask a particular question and there was no evaluation for it but I don't think we have a great sense of what sample space of competencies those are operate over and so that seems like an interesting direction moving forward to think about test collections which really dry the metrics drive performance and they drive innovation but you also get what you measure right so there are two sub parts of dad or thinking that's in my head I mean one part of it is that yeah referring to things as the catalyzation of machine learning is I have this fear that a lot of young people students certainly in terms of the education and experience they get through the school system it might be different once I hit the real world is that they totally miss so much of the problem and the science that they should be looking at that really what should be happening is that there are interesting unsolved problems in the world you want to be understanding something of the nature of those problems and then it might be useful to be constructing some data set that is capturing some of what you can't do and making something that you could do and so there's an enormous number of steps there and a lot of work in defining problems defining what data you should collect defining what would be good measures of success and a lot of the time students these days just don't see that and effectively discourage from seeing that because it's simultaneously both the easier thing to do and the easier way to get short-term success in the machine learning community to say okay well here's this data set that somebody else used I've just downloaded and it's not whatever that's in it that's inert and it's got whatever evaluation measure the last person used and my only goal is to build a bigger deeper neural network that'll score a higher number on that evaluation measure than the last person and then I've got an easy path to my ICML on Europe's paper or something like that and I think that that's effectively a very bad education that we're giving to a lot of students at the moment when that's kind of the viewpoint they have as to how to go about things so that's the sort of big picture concerned and then narrowly close to what it's us asking at the end there's a lot of the work of the last few years have been well here's this one dataset okay we're doing question answering here's a dataset maybe it's the squad dataset from Stanford or MS Mar so from Microsoft and my goal is to build I'm trained in to end a system that performs better on that data set from the previous system and the real problem there is it seems like we've had several years now of people gradually improving performance so we have fantastic squad question answering system choice miss marco answering systems but these systems are really only they're not question answering systems their squad questioning systems and they don't actually generalize very well because they're very overfit to the particular characteristics of that one data set and it seems like we really need to be starting to push natural language processing and also other related fails it's the same story in computer vision right to actually trying to build question answering systems and well the most obvious way to do that is at least doing multi domain systems that we collect in different question-answering datasets and say you should be building one question answering system that works well for all of those but really we'd like to be able to do even more than that we'd like to have something that transfers well – fresh tasks I mean a lot of human intelligence is that we can take a fresh problem when we haven't seen it at all but we can say you know here's this new game to play it's a different game but you know it's not so dissimilar to tennis you're hitting a ball with the records I've got some idea of how to do that oh look this this game squash it doesn't actually have a net so it's not so important and I'm gonna get over something in the middle but and oh look I can bounce off the back wall that's interesting right so that you can kind of extend your knowledge by analogizing from what you have seen but recognizing some of the new properties and you need to start working out how to build systems that can do that and certainly the if you said one way to start and it's just a community problem right one way to start tackling that is to not just work on individual benchmarks but to work on classes of problems and I think we're you know we're seeing some of that think another although it it's you know hard to do from a corpus collection and creation perspective is to build models that work from day to day to day if you're running a web search engine it doesn't matter if your model work tomorrow and it doesn't work today so we need to build models that are robust to non stationarity and document collections and query streams and it's not an easy challenge to address but it I think it's what is needed to move the the field forward yes absolutely and so the multi task evaluation is clearly something that's happening right um so Sam Bowman my former students out at NYU has had quite a bit of success with his Glu benchmark yeah natural language understanding which has ten different natural language understanding tasks and I guess he's got a second version now superglue and question for question answering I guess for they're coming up mrq a machine reading question answering workshop they've sort of put together ten question answering data there so that direction is heading successfully and sort of getting some generality there I think the issue of non stationary tasks and performance is a difficult one I mean something that a lot of the world has been very interested in and I think I mean coming a bit more interesting as doing things with dialogue system right and dialogue systems has been a very difficult area and part White's a very difficult areas it's just not amenable to these fixed datasets because dialogues can take all of these different paths and you can't say well here's some dialog data is makers make a system that works well on that tie okay there because you have to be able to deal with fresh dialogues and the nature of diet everyone wants their dialogues to be topical right right at the moment people are wanting to talk about us-china trade friction but next months hopefully they'll be talking about something else and so you need to be able to sort of yeah change all the time and so how can you build these models how can you optimize them how you can evaluate them and I think we still don't have aren't good answers to that I mean one answer is obviously reinforcement learning and that's a field that is hugely expanded as being applied a lot now but it it reflects the fact that it's that kind of a problem that it seems like the only good way to evaluate dialogue systems a moment is to be running them live and getting human impressions of whether things are working that we just you know in other areas of NLP like question answering the automatic evaluation measure ISM may not be perfect but they're definitely good enough that you can usefully experiment and build systems with them where that just isn't true yet I think in areas like dialogue systems good let me just wrap up with a question about advice that you'd give to two students who are starting in the field I think I've heard a little bit of it which is you know not just focus on existing data sets and just optimizing the heck out of of those but to think more broadly about problems that are general tasks or a bust nosov algorithms but like how do you set that up at Stanford yeah right so at the end of the day there are clearly different ways to be successful in research I mean some people on the sort of more mathy spectrum that if you are the kind of person that is going to think deep thoughts and prove theorems unka or develop new algorithms I mean clearly there are pathways in that direction but for my own work and I think advice for many people that work I think working from a problem oriented direction is just a very successful way to work I mean in particular I mean I think there's been kind of quite a bit of concern and discussion that in this current era of the late 2010s where all there's these huge companies which have 10 times as much computer maybe 100 times as much compute as any university has how can a PhD student at a university compete and I think that he I think the answer is it's easy but it gives some focus to what you have to do because I mean the wrong way to try and compete is to try and say I'm going to build this big huge model that takes weeks to Train on a hundred GPUs and I'll be able to do that better than people at a huge tech company because effectively you won't be able to because you have less resources and you will do it more slowly but in what a much scarce supply is good ideas and it's very easy to be successful doing things at a modest scale with good ideas and well good ideas can be of several kinds you can notice new things in the world which provide new problems right so there's lots of creativity in what's happening in the online world and you can say oh well look there are these new websites that teenagers know about but people are still in my age don't know about and look if you look in this new website and the way it works this is interesting different problem and opportunity where machine learning can be applied and something could be figured out and done that could be a great problem so there's finding fresh problems and finding a fresh problem and presenting you know both the problem and initial models of it that's a really good way to be successful and have a lasting impact but beyond that the other way to do it is not to be saying oh I can build a bigger deeper transform a network some of last person and therefore be successful but to think no there are a lot of things we still don't understand about how to build successful models so although there's been some exploration of how to put longer term memory into neural networks it hasn't very far from a solved area and current systems really don't have much in the way of long-term memories so what might be a good way to better put long-term memory into a neural network that if you can come up with a neat novel idea to do that and demonstrate in the small well then all the people of the big tech companies for the year after the conference will try out the same idea at a ten times larger scale for you but you'll get the credit for having come up with the original idea so I think this of those kind of pathways are the ways to success okay thanks that's fabulous and advice to the community and and to forthcoming students so Chris thanks so much for meeting with us today

As found on YouTube