Note: This blog updates the previous blog post entitled "Spark Only 19% Faster Than Hadoop?"
"The 19% improvement refers to only the improvement from moving compressed, serialized data from on-disk to in-memory. When you store data in-memory natively with Spark, Spark decompresses and deserializes the data into Java objects, resulting in a much larger improvement; this deserialized and decompressed format is usually what people refer to when they say "In-Memory Spark". (The improvement for the big data benchmark from on-disk Spark to in-memory Spark is quantified here: https://amplab.cs.berkeley.edu/benchmark/). So, you could say "Disk I/O is not the bottleneck for on-disk Spark" or "On-disk Spark could only improve by 19% from optimizing disk I/O". It would also be correct to say that our results imply that Spark using flash would only be at most 19% faster than on-disk Spark (because in that case, the data would still be serialized and compressed)."
So the motivation behind their work seems to be, given you have a Spark cluster, what hardware (disk and network) and scheduler/task restructuring changes could you make to improve performance?
Spark is still fast -- many times faster than Hadoop. But the reason it's fast is what is the surprising (to me at least) result determined by the work behind this paper. Spark is fast because data is already deserialized and decompressed. In iterative computations, Spark avoids the serialization/deserialization and compression/decompression round-trips that Hadoop goes through -- at least for data that doesn't go through shuffles. Spark shuffles of course are serialized, and are by default also compressed unless spark.shuffle.compress is set to true.
I apologize to Kay Ousterhout et al for mischaracterizing their results.
See Research Paper "Making Sense of Performance in Data Analytics Frameworks" @ http://bit.ly/1x4uS6c