The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is really hard.
The main components include:
- Hadoop. Java software framework to support data-intensive distributed applications
- ZooKeeper. A highly reliable distributed coordination system
- MapReduce. A flexible parallel data processing framework for large data sets
- HDFS. Hadoop Distributed File System
- Oozie. A MapReduce job scheduler
- HBase. Key-value database
- Hive. A high-level language built on top of MapReduce for analyzing large data sets
- Pig. Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language compiled into MapReduce for parallel data processing.
The range of applications that use Hadoop show the versatility of the MapReduce approach, and reviewing them provides some of the typical characteristics of problems suited to this approach:
- Massive data volumes,
- Little or no data dependence;
- Uses both structured and unstructured data;
- Amenable to massive parallelism;
- Requires limited communication.
Some good examples that display some or all of these characteristics include:
• Applications that boil lots of data down into ordered or aggregated results – sorting, word and phrase counts, building inverted indices mapping phrases to documents, phrase searching among large document corpuses.
• Batch analyses fast enough to satisfy the needs of operational and reporting applications, such as web traffic statistics or product recommendation analysis.
• Iterative analysis using data mining and machine learning algorithms, such as association rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.
• Statistical analysis and reduction, such as web log analysis, or data profiling
• Behavioral analyses such as click stream analysis, discovering content-distribution networks, viewing behavior of video audiences.
• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.
MapReduce is a programming model introduced and described by researchers at Google for parallel computation involving large data sets that are distributed across clusters of many processers. In contrast to the explicitly parallel programming models typically used with imperative language such as Java and C++, the MapReduce programming model is reminiscent of functional languages such as Lisp and APL, in its reliance on two basic operational steps:
• Map which describes the computation or analysis to be applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, and
• Reduce, in which the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results.
Conceptually, the computations applied during the Map phase to each input key/value pair are inherently independent, which means that both the data and the computations can be distributed across multiple storage and processing units and automatically parallelized.
A Common Example
The ability to scale based on automatic parallelization can be demonstrated using a common MapReduce example that counts the number of occurrences of each word in a collection of many documents. Looking at the problem provides a hierarchical view:
• The total number of occurrences of each word in the entire collection is equal to the sum of the occurrences of each word in each document;
• The total number of occurrences of each word in each document can be computed as the sum of the occurrences of each word in each paragraph;
• The total; number of occurrences of each word in each paragraph can be computed as the sum of the occurrences of each word in each sentence;
This apparent recursion provides the context for both our Map function, which instructs each processing node to map each word to its count, and the Reduce function, which collects the word count pairs and sums together the counts for each particular word. The runtime system is responsible for distributing the input to the processing nodes, initiating the Map phase, coordinating the communication of the intermediate results, initiating the Reduce phase, and then collecting the final results.
While we can speculate on the level of granularity for computation (document vs. paragraph vs. sentence), ultimately we can leave it up to the runtime system to determine the best distribution of data and allocation of computation to reduce the execution time. In fact, the value of a programming model such as MapReduce is that its simplicity essentially allows the programmer to describe the expected results of each of the computational phases while relying on the complier and runtime systems for optimal parallelization while providing fault-tolerance.