Yesterday at SC13, I attended at least two presentations about Hadoop and HPC. Hadoop 2 and HPC: Beyond MapReduce, presented by Cray, Inc., further illustrated some of the miscommunication between the two camps. During Q&A, someone asked, "Have you measured against Spark? Hadoop is an entire ecosystem; you have to look at Spark and streaming technologies such as Storm, too, not just Hadoop itself," the Cray representative responded that they're just about to start looking at Spark. (For more on the Apache Spark project, see my 20-minute overview video.) Right away, what a Hadoop person means by "Hadoop" (an entire ecosystem) and what an HPC person means by "Hadoop" (a specific release from Apache in isolation) are different.
After that, though, it started looking even worse for HPC, at least for Cray in particular. The very last question from the audience came from a woman whose tone implied she was surprised no one else had asked the question earlier, something to the effect of, "Isn't there a performance penalty when running Hadoop on a Cray HPC due to the data being centralized on a single server as opposed to the data being distributed amongst the nodes as in a conventional Hadoop cluster?" The response from the Cray representative was that the performance ended up being about the same.
Questions were over by then, and the next logical and obvious question was not asked out of both politeness to the Cray representative and the lack of time: What is the performance per dollar of a Cray running Hadoop vs. a conventional Hadoop cluster on commodity hardware?
Now, to be fair, I'm sure the Cray representative was referring to a comparison on a "Big Data" problem. To digress into this important distinction: there are (at least) three broad categories of problems.
- Scientific simulation or processing. This is where conventional HPC is strong, because data is read at most once and sometimes not at all (e.g. for 3D movie rendering), and computational power is paramount.
- Big Data, where massive data from various sources is "dumped" onto a Hadoop cluster in the hopes that sometime in the future insights will be gleaned. In this second scenario, the data just sits on Hadoop, and gets processed and reprocessed at various times in the future.
- Streaming, which is the extension of batch-oriented Big Data to real-time. The new streaming technologies such as Storm, Spark Streaming, and S4 address this, and I'm not aware of any HPC vendor addressing this class of problem. Indeed, when that first questioner pressed the issue of streaming technology, the Cray representative did not have an answer.
- The classes of problems that their systems don't handle as well or as cost-efficiently.
- To leverage the comparative ease of programming Map/Reduce compared to OpenMP/MPI, the large community and body of knowledge surrounding Hadoop, as well as the popularity of Hadoop and Big Data. Indeed, one SC13 panelist in another session mentioned the difficulty of attracting students into HPC programs due to the popularity of Hadoop and Big Data.
To clarify my point about interconnects from my previous blog post, the interconnect speeds of HPC vs. Hadoop underscore the importance of topology. HPC often uses 40Gbps Infiniband (which has the additional advantage of remote direct memory access (RDMA) to eliminate CPU involvement in communication), whereas Hadoop conventionally has just used 1Gbps Ethernet. For the class of Big Data problems, Hadoop achieves its performance even with the much slower interconnect. There is certainly nothing wrong with Infiniband itself; the point is the opposite, that because such a powerful technology is needed in HPC illustrates the weakness of conventional HPC topology, at least for some classes of problems.
But the set of Big Data problems that Hadoop is good at solving is expanding, thanks to projects like Apache Spark. Hadoop's disk-based implementation of Map/Reduce has conventionally been very poor at iterative algorithms such as machine learning. That is where Apache Spark shines, which instead of distributing data across the disks of a cluster like plain Hadoop does, distributes data across the RAM memories of machines in a cluster. With Apache Spark, 10Gbe or faster becomes useful -- no more waiting for data to stream off disk. The combination of RAM-based mass data storage and higher-speed interconnects is bringing Hadoop into even more domains conventionally handled by HPC. A Hadoop cluster running Apache Spark over 10Gbe where each node has a lot of RAM (say, 512GB today in 2013) starts to look like and certainly at least starts to solve some of the same problems as HPC.
The overlap and "convergence" (as the Cray representative had a slide on) of HPC and Hadoop is growing, due to the performance improvements and expanded domains and software infrastructure (e.g. streaming technology) in Hadoop, and due to HPC vendors adopting Hadoop for the two reasons stated above. The two communities are working to find common ground.
Going forward, that common ground is coming in the form of GPUs. Both HPC and Hadoop communities are adopting GPU technology and heterogeneous computing at a rapid pace, and hopefully as each community moves forward, they will be able to cross-pollinate architectures and understanding of problem domains. The Hadoop community has HPC-like problems, and the HPC community is having to deal with Big Data due to the explosion of data. While there are many success stories already of one or two racks of GPUs replacing room-sized HPCs, the IBM/Nvidia engineering partnership promises to take it to the next level beyond that due to their stated goal of moving compute closer to the data.