Data management in the Hadoop ecosystem is still in the early stages of development. The goal of cheaper and more effective ways of collecting, storing, processing and distributing structured and unstructured data (as well as internal and external data sources) has been impeded by complexity, lack of qualified professionals and difficulty in managing data.
Data movement and management in Hadoop is challenging. It includes data motion, process orchestration, lifecycle management and data discovery. The trick to simplifying data management in Hadoop is to process data in a decentralized fashion by pushing complexity into the platform - enabling data engineers to focus on the processing / business logic.
Apache Falcon is an open source data processing and management solution for the Hadoop ecosystem. It simplifies the management of data by enabling users to define infrastructure endpoints (e.g., clusters, HBase, databases, HCatalog), logical tables/feed/datasets (e.g., location, permissions, source, retention limits, replication targets) and processing rules (e.g., inputs, outputs, schedule, business logic) as configurations.
Hadoop Falcon addresses:
Falcon allows users to on-board data sets with a complete understanding of how, when and where their data is managed across its lifecycle. It uses Apache Oozie for coordinating workflows. Workflow templates are used for data management. Falcon provides open APIs that enable those workflows to be orchestrated more broadly to allow integration between data warehouse systems (e.g., orchestrate data lifecycle workflows within Hadoop as well as with a Teradata system).
The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.
One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.
Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. Download for a test drive: http://spark-project.org/downloads
Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries. Spark provides an easy-to-program interface that is available in Java, Python, and Scala. Spark Streaming is a component of Spark that provides highly scalable, fault-tolerant streaming processing. With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.
Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.
The Shark data analysis warehouse system:
By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases.
Take Shark for a drive on EC2 (takes 10 minutes to spin up a cluster): http://shark.cs.berkeley.edu
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.