Each cluster has one “master node” with multiple slave nodes. The master node runs NameNode and JobTracker functions and coordinates with the slave nodes to get the job done. The slaves run TaskTracker, HDFS to store data, and map and reduce functions for data computation.
The basic stack includes Hive and Pig for language and compilers, HBase for NoSQL database management, and Scribe and Flume for log collection. ZooKeeper provides centralized coordination for the stack.
If one of the data replicas on HDFS is corrupted, JobTracker, aware of where other replicas are located, can reschedule the task right where it resides, decreasing the need to move data back from one node to another. This saves network bandwidth and keeps performance and availability high. Once the job is mapped, the output is sorted and divided into several groups, which are distributed to reducers. Reducers may be located on the same node as the mappers or on another node.
cost of ownership can help you optimize resource utilization while minimizing operational costs.
Hadoop environment settings are a key factor in getting the full benefit from the rest of the hardware and software solutions.
Guidelines for Hardware Configurations
Determining the number, type, and configuration of the servers in the cluster is an important decision. The guidelines below recommend a typical configuration and focus mainly on the slave servers. Master servers should deploy additional RAM and secondary power supplies to ensure the highest performance and reliability of these critical machines. Because workloads may be bound by I/O, memory, or processor resources, system-level hardware also may need to be adjusted on a case-by-case basis.
Server platform. Dual-socket servers are optimal for Hadoop. They are generally more efficient, from a per-node, cost-benefit perspective, than large-scale multiprocessor platforms. A higher upfront hardware investment per node is offset by gains in efficiency for load balancing and parallelization overheads. Using the most current server platform will ensure the best energy efficiency and performance overall.
Choose systems that are optimized for airflow and use high-energy voltage regulators. Many vendors offer Intel Xeon–based processors optimized for the cost, density, weight, and power consumption characteristics required by high-density computing environments.
Hard drives. Typically four to six hard disk drives are the right number for a balanced Hadoop cluster, with 12 or more for I/Ointensive use cases. Each drive capacity should be 1 TB to 2 TB per drive. Using RAID on Hadoop servers is not recommended.
Hard drives should run in the Advanced Host Controller Interface (AHCI) mode with Native Command Queuing (NCQ) enabled to improve performance for multiple simultaneous read/write requests.
Memory. Memory must be sized and configured for efficient throughput of simultaneous parallel processing of large numbers of map and reduce tasks. Recommended RAM is 48 gigabytes (GB) to 64 GB.
Error correcting code (ECC) memory is also recommended to detect and correct storage and transmission errors.
Guidelines for Software Selection and Configuration
Selecting and configuring the operating system and application software have major implications for performance and stability.
Operating system. Use a Linux distribution based on kernel version 2.6.30 or later for deploying Hadoop because of built-in optimizations for energy and threading efficiencies. Even small power efficiencies multiplied over a large Hadoop cluster could add up to significant savings. To optimize Linux:
• Mount local file systems with the noatime attribute.
• Run Java* Virtual Machine 6u14 or later, which offers advantages such as compressed ordinary object pointers.
• Increase the default Linux open file descriptor limit from 1,024 to 64,000; increase the maximum number of processes to prevent OutOfMemoryErrors.
Hadoop Configuration Tuning
The benefits of the following recommendations depend on your specificHadoop environment. Get the best results by experimenting with the following configuration settings.
The main components include:
- Hadoop. Java software framework to support data-intensive distributed applications
- ZooKeeper. A highly reliable distributed coordination system
- MapReduce. A flexible parallel data processing framework for large data sets
- HDFS. Hadoop Distributed File System
- Oozie. A MapReduce job scheduler
- HBase. Key-value database
- Hive. A high-level language built on top of MapReduce for analyzing large data sets
- Pig. Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language compiled into MapReduce for parallel data processing.
The range of applications that use Hadoop show the versatility of the MapReduce approach, and reviewing them provides some of the typical characteristics of problems suited to this approach:
- Massive data volumes,
- Little or no data dependence;
- Uses both structured and unstructured data;
- Amenable to massive parallelism;
- Requires limited communication.
Some good examples that display some or all of these characteristics include:
• Applications that boil lots of data down into ordered or aggregated results – sorting, word and phrase counts, building inverted indices mapping phrases to documents, phrase searching among large document corpuses.
• Batch analyses fast enough to satisfy the needs of operational and reporting applications, such as web traffic statistics or product recommendation analysis.
• Iterative analysis using data mining and machine learning algorithms, such as association rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.
• Statistical analysis and reduction, such as web log analysis, or data profiling
• Behavioral analyses such as click stream analysis, discovering content-distribution networks, viewing behavior of video audiences.
• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.