One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.
Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. Download for a test drive: http://spark-project.org/downloads
Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries. Spark provides an easy-to-program interface that is available in Java, Python, and Scala. Spark Streaming is a component of Spark that provides highly scalable, fault-tolerant streaming processing. With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.
Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.
The Shark data analysis warehouse system:
- builds on Spark (MapReduce deterministic, idempotent tasks)
- scales out and is fault-tolerant
- supports low-latency, interactive queries through in-memory computation
- supports both SQL and complex analytics such as machine learning
- is compatible with Apache Hive (storage, serdes, UDFs, types, metadata)
By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases.
Take Shark for a drive on EC2 (takes 10 minutes to spin up a cluster): http://shark.cs.berkeley.edu
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.