Google I/O 2011: App Engine MapReduce

Google+ Pinterest LinkedIn Tumblr

MIKE AIZATSKYI: Good afternoon, ladies and gentlemen. I’m really excited to see you here today. I’m Mike Aizatskyi from App Engine Engineering Team. And today we would like to talk to you about App Engine and MapReduce. So the agenda of today’s talk is to go briefly through MapReduce Computational Model, because I assume some of you would like another reminder of what it is. Then we’ll remember the Mapper library, which we’ve now released like a year ago with previous Google I/O.

Then I’ll make a couple announcements. I’m going to talk about some technical stuff to get you interested in how we did some things in App Engine Team. And throughout the whole presentation, I’m going to show you several examples and demos just to keep you entertained. So let’s start with MapReduce Computational Model. You might know that MapReduce was created at Google many years ago to solve problems with efficient distributed computing.

And really, absolutely every team uses MapReduce for something. I’m absolutely sure that even Android team uses MapReduce for doing some stuff, for doing some population statistics, whatever. And it’s been a real success at Google, and we use it for everything. And this is a diagram of how it looks like. It might be a little bit confusing, so we’ll go through it step by step. But it basically has three major steps, and those are called map, shuffle, and reduce. So the map shuffle is some kind of user code which runs over input data. That input data might be database, some files, whatever. And the goal of the code is to output key and value pairs. And the MapReduce framework should take those key value pairs and should store them somewhere for other steps to consume. And as you see, the input is usually sharded, like split in parts, so that all of them can be processed and paralleled really, really fast. And as I said, this is user code. The second step of MapReduce is a shuffle step.

And the shuffle phase has to collate values with the same key together. It has to go through all the values which are generated by map phase. It should find similar, same keys. It should bundle those values together, and it should output key and list of value pairs. And this step has no user code whatsoever. This is a generic step of web-produced framework, provided usually as some kind of service, as some kind of magic black box which does shuffle. And the third part is reduce.

And the reduce function will actually take the shuffle output, which is the key, and list of value pairs, and it does something with that. It might compute something more interesting from that data. It might save it to database. It might do something. And this is again the user code. And again, it’s goal is to process the shuffle output as fast as possible, and it’s good if it’s parallel. So once again, this is the diagram of MapReduce Computational Model, not the implementation but just the model. So we have map step, key values, shuffle bundles same keys together, and reduce, process, and shuffle output.

So the way we work at App Engine, we usually take something which we already have at Google and which works for Google, which works for us, and we give it to you. We give it to our developers. We bundle it somehow, and we give you access to that same set of technologies. And this is exactly what we wanted to do with MapReduce. But when we started working in that, we realized that there are lots of problems with this approach, unique to the App Engine. First of all, App Engine has scaling dimensions which we usually don’t have at Google. App Engine has lots and lots of applications, much more than Google applications. And we expect many of them that will run MapReduce at the same time, and that might cause additional scalability problems. The other problem is isolation. We do not want anyone running something to kill a royal wedding website.

If you run MapReduce, you want royal wedding to be up. So we want to completely isolate all applications from each other, but still we want them to have good performance. It might be surprising, but we also want to rate limit. Originally, MapReduce at Google was designed to go as fast as possible. It wants to process all the data in shortest amount of time possible. And that means that we can consume thousands of dollars resources per minute. And not everyone is happy about that, because you might run MapReduce and it will consume all your daily resources in 15 minutes. It will kill your online traffic, so you might want to rate limit it and say, OK, don’t run in five minutes. Please run over two hours, because I have short-term quotas which I don’t want to consume.

And taken to the extreme, there are even three apps which have a very, very limited amount of quotas, and they still want to run MapReduces. And they might say, OK, run this for the whole week, but just get it done. But please, please, please do not kill my application, which is not a goal for MapReduce, Google’s MapReduce to run as slow as possible. We also have a protection issue, too, because there a lots of users out there. Some of them are malicious. Some of them are just not doing the right thing. And we want everyone to protect from each other. We want to protect Google from you. We want to protect you from each other, so that nobody steals over there.

So when we analyzed all this, we realized that it’s not a really good idea to take the Google’s MapReduce and give it to you as is. We realized that we need to do lots and lots of additional work. And that’s why, a year ago, we released the Mapper Library, which implemented the first step of MapReduce, which is map phase. We released it at Google I/O 2010, in this building. Since then, it’s been heavily used by developers everywhere, outside, inside Google. We even use it in App Engine Team. We use it in admin console. We generate reports with that. We have new index pipeline, which is in the works, which uses Mapper Library.

And lots of people– like, there was recently a blog post about guys using Mapper Library, which is already useful. And during the whole year, we’ve been working on improving that. And they had quite a lot of set of improvements to the library. Let me go through some of them. We have Control API, so that you can programatically control your jobs. So that you can, for example, create a cron job, which will run everyday and start Mapper. We have custom mutation pools, which is the way to batch some work within map function calls and process it more efficiently. We have namespaces support, so that normally you can run your Mapper over some fixed namespace, if you use namespace, or some several namespaces. You can even run Mappers over namespaces, not the data inside namespaces but just over namespaces, so that you can analyze your namespaces. You can do some kind of migration, statistics, whatever. We have implemented better sharding with scatter indices. So they think sharding is now pretty good. And there are many more small improvements to the library. So if we take a look at this, and we ask a question, what should we do to take Mapper Library to the MapReduce Library? The first thing is we need some kind of storage for intermediate data, because map jobs, they tend to output lots and lots of data in MapReducers.

And for that one, we designed the Files API, which we’ll go through in more detail later. And we released it in March 2011, just two months ago. We also have to implement the shuffler service. And reducer is not a problem, because reducer is actually the mapper. It just goes over shuffle output. And of course, we have to write lots and lots of glue code. So today, I’m happy to say that we’re launching the shuffler functionality. And this actually has two different functionalities. First of all, we have a full in-memory user space, task-driven open source shuffler for kind of small data stats. And small, I mean like, hundreds of megabytes, maybe gigabytes, maybe even more. Who knows? This is completely open source for Python. Java comes soon. It will be open source, too. We also announced that we want to start trusted testers access to big shuffler.

So if you have some kind of data set, which you need to run MapReducer on, and you have hundreds of gigabytes of data and you want to run MapReducer, just get in touch with us and we will see how we can work for you. And we also, of course, written all the integration pieces which are needed to run your MapReduce jobs. And they are part of Mapper Library, which we are now deciding to call MapReduce Library. It’s not a map library anymore. So without going– we’ll go to the technical details later. But first, let’s go through some simple examples of how MapReducers look like.

And the first really simple example, which is the, hello world, of MapReduce, is a word count MapReduce. And the goal of MapReduce is to take some set of text, some text, and calculate how many times a single word appears in the text, like basically statistics of distributions. And map function is really simple. It’s almost like, two lines, even if you throw in some additional declarations. Basically you split the text in words, and your yield word is the key and empty screen is values. We’re not using values here, per se. We’re just going to need them to count those. So if you take the famous quote from the Pulp Fiction, then for five words, we can output five key value pairs.

Then the shuffler will take this data, will process it, and then will bundle all values for the same keys together. So we’ll have three pairs that maybe– that are going to have all two values from the input, which are empty screens that also two. And baby’s going to have only one. And reduce phase just ignores values. It just counts how many of those there are. So it’s going to output Zed’s two times. Dead is two times. And baby is one time. So let’s see how this works. Is it large enough? So unfortunately, this stuff runs for several minutes in the data set that they chose, so I decided not to run it directly. And I took 90 books, 90-something books from Gutenberg Project, which amounts to tens of megabytes of text. And I run this MapReduce. It took six minutes. And you can see that this is our new UI for multi-stage processes. We’re going to talk abut about this with it later. But you can see that the whole MapReduce count took, like, six minutes, 40 seconds.

Map phase took one minute, 40 seconds. Shuffle took three minutes. And reduce took one minute, 50 seconds. And there were some additional steps, like cleanup and some statistics steps. And if you take a reduce job, which has finished in a minute– here it is one minute, 50 seconds– we can see that it read 78 gigabytes of intermediate data, which is quite good. I’m told that shuffler is supposed to process hundreds of megabytes, but we’re actually being run to being able to run it like over 78 gigabytes without big problems. The output– how much? One gigabyte of data. And the total input size was just– the input size, we have, like, 50 megabytes of text, books. This is how its output looks like. So we have all the words, which are in some order. They’re kind of sorted, but we’ll talk about this later. But we can see that captivity, or like, I don’t know what it is, but it’s actually present only once. So if we use some kind of graph, with this word, we can see that it actually happens only once, in this only book.

And yeah, it works. So this is the word count MapReduce. And now let’s improve our word count MapReduce to build the simplest possible, really useful MapReduce, which is a building block for, even for google.com. And this MapReduce is called building inverse index, which means that we want to build a huge map which allows you to look up which documents have a given word. And you can actually say that this is how Google Count works. So if you have internet on your CD-ROM and you upload it to App Engine, then you can build the Google Count search index. So we use the same stuff, but instead of ignoring file name, instead of outputting empty screens, we’ll output the file name, or document name, or something which identifies where this word comes from.

And the only thing we do in reduce, we just remove duplicate values. We just built the list that this word is going to occurs in this document, this document, and this document. Of course, instead of just duplicating documents, you can calculate how many times it occurs in documents. So later you can compute some kind of ranking, whatever. But this is the simplest one. Again, I’m going to show you that it’s actually ran a little bit slower for some reason. It’s undetermined minutes of time. But yeah, it took six minutes, 50 seconds. If you take a look at the size of intermediate data– here it is.

We can see that we actually processed 170 gigabytes of intermediate data. And the whole MapReduce took six minutes, which we think is actually quite good. So here is how the output looks like for the whole 60 megabytes of books from Gutenberg Project. So you can pick– now you can get this document. And you can actually build a web storage, which has some kind of search here. So let’s verify that it works. Let’s take a look at aviators. Here it is– want to see aviators, everything. You can actually see that there are only two documents which have this word, 1059 and 3061, which is true. So I’m quite confident that it works. So these were two most common and the simplest MapReducers you could write. Later in this talk, I’ll show you a little bit more complicated and interesting MapReduce. But now I would like to dive into technical bits of our solutions.

I’d like to talk about how we did what we did. And actually I would like to show you that you could build something, or you can build something, similar or modified to your needs, now that you know how it works. And I’m going to talk about two things. I’m going to talk about Files API, which was our solution to MapReduce storage problem. And I’m going to talk about User-Space Shuffler, about how we built it, which algorithm did we use, and so on, so forth. So Files API. As I told you, and as you have seen, MapReduce jobs generate lots of intermediate data. This was a tiny, simple MapReduce, which is just an example, and processes just 60 megabytes of input data. And it generated 170 gigabytes of intermediate data. And of course, you have to store it somewhere. So before Files API, we had three storage systems. We had Data Store, which is kind of expensive because it’s replicated over the multiple data centers, lots of copies.

And it also has one megabyte entity limit, which makes storing 170 gigabytes of data kind of hard. You have to split it. You have to remember where you put its chunk. You have to read it in order. It’s not a pleasant task. We also had a blobstore, which is really good, but it’s redundant. The only way you could get data in, you could upload files through a CPU form, which does not work for MapReduce. And you had Memcache. Memcache is kind of good. It’s really fast, but it’s small and it’s not reliable. You can never be sure that the data you put in is still there, because there might be memory pressure or something else. So that’s why we designed the Files API. We wanted to give you familiar files-like interface to some kind of virtual file systems. These are not local files. These are virtual file systems. They look like local files.

If you have ever accessed local files, you will find it similar to you. We released it in 1.4.3. We integrate it with Mapper Library the way I’m going to show you. And we kind of consider it to be a low-level API. It’s not something that every user, we think, is going to need. Most users will probably just use MapReduce Library. But you might find some interesting needs for this one, too. And it’s also still an experimental library, experimental API. We’re still looking at how it works. We’re still trying to understand what you need, what we need. And these files– I said that the API is really familiar to local file systems. But it has some huge differences. Basically, the biggest difference here is that files have to stay writable and readable. In writable state, you can write the file but you cannot read it. In readable state, you can read but you cannot write to it. And all files start in writable. Later, when you’re done writing, you can finalize the file, and you can transfer it to readable state. There is no way out of readable state. Once it’s finalized, it’s forever.

All writes are append-only, so there is no random access. All writes are atomic, meaning that if you output 100 bytes, then nothing is going to get in between of those. And they are fully serializable between concurrent clients, meaning that you might have lots of big round tasks, lots of tasking tasks tasked right into the same file at the same time and we’re going to make sure that all the data does not interweave between them. And they also say that some concrete file systems, they might have some additional APIs for dealing with their own properties.

And every file system is going to have their own reliability, or like volubility constraints or guarantees. Files API doesn’t know about that. In March, we’ll release the first file system. This is a blobstore filesystem. So we actually took the read-only blobstore facility and we made it writable. So you can now write directly to blobstore. You can create really huge files. There is no limitation aside 64-bit plans, which I think is going to be enough for the next thousand years. Once you finalize the file, those files are fully durable. They’re going to be replicated across multiple data centers. And they’re going to have the same reliability guarantees as high-replication data store. But the important point here is that writable files are not durable, meaning that if something bad happens, for example, we have data store failure, or we might have a maintenance period. So you might lose your files while they’re in writable state. It is especially so that while you write, we don’t replicate it, so that [UNINTELLIGIBLE] is good. Because they think that MapReduce is going to be the primary consumer of this API. OK. If something bad has happened, you’re going to restart MapReduce jobs.

And since this is blobstore, you can fetch a blob key for finalized file and use the familiar blobstore API. So if you want to download that file or serve it to the user, you could get the blob key and serve it directly to the user without writing too many code. This way you can create a, I don’t know, you can write a code which generates some kind of movie. You just write the blobstore frame by frame, then you finalize it, and you can serve it. This is a simple Python example of how you could access it if Java has its own API. And yeah, we don’t have much time to go through all of those, but basically it looks familiar to the language. It’s mostly like the files you have in Python. So first of all, you create the files in blobstore. And you specifically say, OK, I want a file from blobstore.

Then you open it to write. You write data. And once you’re done, you say, OK, finalize the file. To read it, it’s the same stuff. You can open file. You can read from it. You can seek in the file. Write is append-only, and read can have random access. So you can do lots of crazy, interesting things. And if you want to get blob key, you just call, get blob key function, and you’re going to get a blob key for filing. As I said, we integrate with Mapper Library. We edit some kind of output writers support. So there’s one line in MapReduce. It’s [UNINTELLIGIBLE]. It would just specify, OK, I want the output of this Mapper function to go directly to blobstore. This way, by having single-line map function, we just confirm send it to the csv line, or xml, whatever you can– you can easily write a mapper which will export your data into blobstore. And since you have control API, you can actually create a cron job, and do a copy of your data every day, or every week, or run it whenever you need. We also plan to migrate, both download it and upload it this functionality.

We’re not there yet, but it will eventually come. But meanwhile, it’s now easy to get the data in some sort of downloadable format. And we also have some low-level features, which we’re not going to spend lots of time, but basically have exclusive locks. You can lock files. You can make sure that no one else accesses it. And we have sequence keys, so that if you need some kind of guarantee that you have written the data you’re already writing, because request might be retried, possibly might be rerun, or request might come the second time from the user.

That there is some notion of sequence key and the API will take care of you. If you need this, read the API documentation and source code. And in the future, we’re going to introduce more file systems. The first one which we really, really want to get out, we call it Tempfile system. This is going to be much faster, much cheaper, because it’s not going to replicated anywhere. But it’s not going to be durable at all, even finalized files. It might lose those if something bad happens. And you’re going to have each file be stored only for several days, because we want this to be a MapReduce banking file system. We want it to be really fast. We want it to be able to store terabytes there without selling your house. Another thing is that we want to integrate with other storage technologies, which we got at Google. Yeah, just keep looking at we’re doing and we have– we’re going to have lots of exciting file systems coming. So this was Files API. And now let’s talk about the bread and butter of our MapReduce implementation, which is User-Space Shuffler.

So as I told you, we’re going to have two shufflers: the User-Space, which is kind of simple, open source, and it has its limits on the data it can process; and we’re going to have a huge shuffler service, which is hidden, which is just an API code which says, OK, shuffle this for me, and is going to process lots and lots of data really, really quickly. But we wanted to have User-Space Shuffler for small and medium jobs. And as you remember, the job of the shuffler is to consolidate values for the same key together. So look around the data, find those same keys, values together. And even that we want to have this User-Space Shuffler, meaning it is in full source code. It has no new App Engine components. Anyone could have built something like that We still want it to be reasonably fast, reasonably scalable, and reasonably efficient. Because we just saw that, yeah, 170 gigabyte is not a big deal. It’s just a couple minutes, which is good. So the way we’re going to arrive at shuffler implementation is we’re going to start with really, really stupid shuffler.

We’re going to improve it step by step until we get to the file algorithm. And the first inaugural shuffler is just load everything to memory, sort it. And read the sorted array. This is going to look like this. Once you have your values sorted by key, you can just read and see, OK, this is the same key, grab this value. Same key, grab this value. OK, this is another key. Here are your values. It’s really just a couple lines of code, but unfortunately, it has lots of problems. First of all, it’s memory-bound algorithm, by how much data you can fit in memory. Like, if you’re having a separate workstation, like server, computer, you might be lucky to have 64 gigabytes. So it’s actually memory-bound, and there is no way to increase performance by throwing more resources into it, because it’s single-thread. It’s just sorted, no way out of it. So we’re going to improve this one by chunking our input data into chunks, and store the data into Files API.

And then each chunk will be sorted by its own. so those chunks should be of some size, because when you’re writing in App Engine, in Python, it’s not really memory-efficient, our chunk size is like five megs. So we’re going to sort each one of those, store it, and then we’re going to merge stored all chunks together using external merge. Or even as MapReduce Library does, we’re just going to merge-read without sorting, just merge-read from all of those. This is really simple. Your upper two blobs go to the first chunks, lower three to the second chunk. And then you’re going to merge-read by maintaining pointers into each one of those.

You read from all of them. You see, OK, blue, blue, blue, blue. Then you move each one of them. And yeah, it’s going to work. And the properties of this approach is that first of all, this is no longer memory-bound. You can sort as many chunks as you want, as soon as each one of those fit in memory. Another good thing that this sorting is parallel.

You can throw in thousand computers, give each one a chunk, and they’re going to finish in parallel. But unfortunately, the merge phase, merge phase is not parallel. It’s just one phase in a single computer, which has to merge-read all of them. It is kind of difficult and slow to merge-read from too many files, like you could have 10,000s of files, it should have some kind of heap structure to maintain pointers into 10,000s of files.

Yeah, it’s getting really, really slow once you crank up number of files. So we’re going to improve this, too. And the way we’re going to improve it is we’re going to use the hash code of the key. So it’s really like, if hash codes of the key is divisible by two and not divisible by two, then we can process all those keys, two keys, separately. Like, all even keys go into one chunk. All odd keys go to another chunk. And we can run the same algorithm from the previous step on all those chunks together. We don’t have to worry about merge-reading all of them, because each one of them has different hash keys. This is how it’s going to look. It looks quite nice, you know. And then I needed a fourth color. But basically once again, we take a look at each block output from Mapper. We separate it into chunks by hash code. Like, blue and red goes to the top, green and yellow goes to the bottom.

Then, we sort data from each hash chunk. So we sort some greens and yellows. We sort some blues and greens. And then we’re going to do a merge-read from only those chunks. And this approach is actually quite good. Because by having hash code, you can split your output data into as many hash code-based chunks as you want. You want thousands? You got it.

You want 10,000? So you can have, merge, as parallel as possible. You can run thousands of merge-reads in parallel. All sorting is in parallel. There is no memory-bound thing here. If you have to read too many data, just increase number of hash chunks and process them separately. And this is actually the shuffler, which we released today for Python. And the Java version is going to come really, really soon. As far as I know, even someone is working on that. Yeah, this was the shuffler. This is the way it works. Of course, the goal is more complicated than that, because we have to deal with lots of tiny details. But this is the idea. And we soon realized that our MapReducer consisted of several offline processes like map. You have to finish map before you start shuffle. Shuffle, you have to finish shuffle before you start reduce. So we designed a new API to chain complex works together, and we called it the Pipeline API.

And this is actually the glue which holds the Mapper, Shuffler, and Reducer, and which makes it a single MapReduce job. And our MapReduce Library is now fully integrated with Pipeline. And that UI which you saw was multiple stages that actually come from Pipeline API. It’s an API which takes care of all that detail, that, OK, after you’re done with user, you have to run cleanup jobs to clean up all those intermediate files.

This is the API which enables to do that. And if you’re interested in this API, I want to emphasize the talk of my colleague, Brett Slatkin. It’s called “Large Scale Data Analysis Using the App Engine Pipeline API.” I think it’s 4:00 something. So yeah, please come in if you’re interested in running your own complex processes. And now, I promised you a more complicated, more involving example of MapReducer. And the examples, we’re going to write is which I call distinguishing phrases. So the idea is to take a look at multiple books and figure out if some phrase has been used in one particular book a lot, but not used in the other books. Like, if some authors have their own signature phrases, or some books have signature phrases. The way we’re going to do it is we’re going to split the text into ngrams. ngrams is a sequence of n consequent words together. And you can pick the n which you like, so I ran the examples for n equals 4. The problem here, if you increase n a lot, like if you use 10 grams, then 10 grams are probably quite unique because these are complete sentences already.

And they’re quite unique. And if you use n equals 2, this is not even a phrase. This is just two words. So I picked 4. And for its use, the first thing we’re going to do, we’re going to check if this phrase is actually frequent. We’re not interested in something the author used only once. So we check if it occurred 10 times. Then we’re going to check if a single author or a single book has this phrase more than every other book combined, which means that this is really unique for this book because it’s more than everyone else. And we’re going to run this on the same data set, which is 90 books.

Let me show you how it went. So it took like, seven minutes or something. This is, as I told, the UI which comes from Pipeline library. So we have map, line, shuffle, reduce. If you actually open the shuffle, you can see that there are lots of sort pipelines. And each sort pipeline, is certain one hash based bucket the way I explained to you.

And yeah, so it finished in like, seven minutes, 25 seconds. Just to have an idea of if this seven minutes is good or not, I wrote a simple Python program, which uses the same MapReduce Computational Model, but completely in-memory in a single computer, and in one thread, because it’s Python. And on my computer, it took like, almost 20 minutes to complete. So this already faster than writing in computer. And this actually scales, so if you increase the data set, you could just increase number of shards, and you’re not supposed to see a huge time increase. And when I was creating this demo, I wasn’t even sure that this was going to work, but it turns out there are lots and lots of unique phrases in books.

And this is the output. There are lots and lots of phrases here which you could use. Here it is. Money price of corn. Let’s just check. Oh, this is not fair. This is just one book which always talks about money price of corn. Let’s pick something else. This is Spanish. Yeah, these are the same. It’s surprising that this is unique, right? Yeah, so you can see that there were three books which use the phrase only once. “These are the same” something. And there is one book which used it seven times. And I’m just curious, this is the– what is this? This is an old book of Leonardo da Vinci. So you can say that in this book set, the use of the same phrase, kind of uniquely identifies Leonardo da Vinci. If you have a huge corpus of text, I guess it might become really interesting. And this was all like– if you throw in all the details which I omitted here, like count occurrences in line of functions, it’s like 50 lines of code. It’s not a big deal, and it’s really something interesting.

So this was all I wanted to talk to you about today, and let me summarize what I was talking about. So starting today, I did a push couple hours ago into [UNINTELLIGIBLE]. You can grab the source code, and you can take a look examples. The communication is coming soon. And you can run small, medium, and who knows how big they can be, these User-Space Shufflers. We haven’t actually pushed the limit yet. You can run those MapReducers today in Python. Java’s going to come really soon. If you’re interested in helping us to get Java implementation, just write us an email and we’ll tell you where we need help. And I know there are already people helping us. Right? Yeah. And if you’re interested in large MapReducers, like maybe we should call them huge MapReducers, get in touch with us. We’ll give you the API to access the full-blown shuffler for certain terabytes of data. I don’t know. So now I’m ready to answer any questions you might have. Thanks. It looks like magic, but you know, I didn’t quite understand the shuffle part and the merge portion.

You know, as a framework, it looks very complete. But when you run it on Google App Engine, you have the task use which allows you to do map. But as far as the merge and the shuffling is concerned, if you run it parallel-ly as different execution threads, you need to have a common place where you can do merge. It can even be a data store, or it can be memory. If it’s data store, the current BigTable doesn’t allow you to do a Select For Update thing. Which means if you try to implement the same inventory from two different threads– right. So when you do an update, and when you read something else, it gives you dirty data, right? MIKE AIZATSKYI: So basically, yes. The Pipeline API gives you some kind of synchronization point. It takes care of all of that by itself. It knows all the limitations of App Engine platforms. It knows that it stores data in data stores. It knows that it’s bad to access the same entity from multiple requests at the same time. And Pipeline API hides all the complexity for you. It just defines a job as a simple Python function.

And you say, OK, run it after this one completes, and pass all the output here. And it’s going to take care of that for you. OK. Right. So where do we get– because we are trying to do this for aggregate queries, and you’re not able to figure out a way within the– MIKE AIZATSKYI: It doesn’t use any kind of aggregate queries. It uses key names in a really interesting fashion to achieve that. Right. But key name, you know, aggregating information is the objective of merging to a key name, right? MIKE AIZATSKYI: Yeah. For a use case like aggregate queries. MIKE AIZATSKYI: Yeah, but just take a look at source codes. It’s all open source. Anyone else want to be on tape? So then we’re done. Thank you a lot for coming. If you have any questions which you want to ask me off the record, feel free to come to me.

Thanks a lot.

Read More: Nancy Pelosi’s evolution on impeachment

As found on YouTube