Hi HN, author here. Corral is my attempt at a performant, easy-to-deploy MapRedu...

Cthulhu_ · on May 7, 2018

I'd like to see your readme expanded with some figures:

* Processing speed - that is, how long does it take to do that word count example on a nontrivial dataset? Something that takes hours on a local machine, vs minutes in map/reduce. Comparing local to this to e.g. Hadoop or Google BigQuery or whatever viable alternative there is. * Cost. I think that's probably the biggest factor here. I don't get the impression that Lambda was intended for big data or highly resource / i/o / processing intensive operations, but, I'd love to be proven wrong. * Actually, mostly just cost vs performance.

I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.

quickben · on May 7, 2018

Hi,

For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?

bcongdon · on May 7, 2018

Pricing depends a lot on how much memory your job requires[1] and how much processing each record requires -- i.e. the pricing is more sensitive to usage.

As a very rough estimate, for a light-to-medium load of 1Tb, the cost would probably be in the ballpark of ~$0.50. AWS's own reference MR framework[2] (which is mostly a tech demo) quotes prices in a similar order of magnitude.

Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises quickly if you need a lot of memory or take a lot of time with each record. But, for small-ish low-overhead jobs, it can pretty easily beat the pricing and hassle of using something like EMR.

[1]: https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail... [2]: https://github.com/awslabs/lambda-refarch-mapreduce/

shaunray · on May 8, 2018

Hi OP, I am the person that made the most recent changes to the AWS Labs refarch. I had been working on a golang version and wanted to clean up the python one. Sunil the original author used the AMPLABS benchmark to calculate the results table. I was planning on updating it with the 1 and 5 node test. Would be happy to include Corral as well.

eranation · on May 7, 2018

Nice! Always thought this would be cool. Few thought questions though: how do you get things like consistent hashing? Spark for example can shuffle data somewhat efficiently by sending the data to the right node / getting data from the right node by the hash key, right? How in a serverless stateless world you call a specific Serverless function instance? Assuming it can’t be done, arent you losing performance gained by data locality? Eg data has to be saved in a massive and efficient key value store? Isn’t it much slower than spark’s in memory / data locality (bring the compute to the data and not vice versa). Would love to see benchmarks on this. This is the future IMHO... well done.

optimuspaul · on May 7, 2018

Nice! I've been thinking about doing this kind of thing for a while. I have long experimented with doing map reduce style work outside of things like Hadoop and gotten much better results due to being able to tune for different things much more quickly.

navaati · on May 7, 2018

Hi.

How do you deal with the 5min (IIRC) execution time limit of Lambda ?

bcongdon · on May 7, 2018

Yeah, max execution time (and max memory usage) are the main constraints of using Lambda.

Corral deals with this by splitting input data into small enough chunks that each chunk can be processed within the timeout -- I exposed options for setting the amount of data that each Lambda function has to process. However, if each data item requires more than 5 min of processing, then corral won't work for you.

The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.