Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset. 

In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.

As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce. 

Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.

Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.

See: http://research.google.com/pubs/pub36726.html
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.

Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees. 

See: http://research.google.com/pubs/pub36632.html
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. 

Pregel is architected  for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

See: http://kowshik.github.com/JPregel/pregel_paper.pdf
 


Comments

Rich Miller
07/18/2012 23:16

Percolator appears to be a transformative system.

Reply
07/18/2012 23:35

Incrementally processing updates to a large data sets is indeed transformative.

By replacing a batch-based indexing system with an indexing system based on incremental processing speeds up the process significantly.

Percolator is a winner. Well done Google!

Reply
07/19/2012 09:23

Percolator, Dremel & Pregel is the next evolution for large data structure processing and analytics. It is better and faster than Hadoop. It may be cheaper.

Funny that just as Hadoop is creating a good ecosystem it may be replaced by a new technology.

Reply
07/19/2012 09:27

I don't think it replaces Hadoop - just an alternative.

Hadoop will make better sense for some firms.

Reply
07/19/2012 09:41

Percolator, Dremel & Pregel is way too much firepower for most organizations. Only makes sense for huge - and I mean huge data sets.

Reply
07/19/2012 13:45

You can say same for Hadoop and many systems.

Yet what is overkill today may be right in near future. Sooner than you think considering the exxplosion in data growth, and variety.

Reply
Michael Walker
07/19/2012 13:48

Google is now using Percolator, Dremel and Pregel.

Reply
07/19/2012 14:14

Percolator along with Dremel and Pregel is a more powerful and faster processing ecosystem than Hadoop's MapReduce ecosystem.

Hadoop still works well - but the issue of speed will matter more and more.

Reply
07/20/2012 07:34

There is a buzz about Hadoop's days being numbered. The king is dead, long live the king!

Percolator, Dremel & Pregel is simply an alternative to Hadoop - not a replacement.

Reply
07/20/2012 07:49

Apache Hama is an implementation of Pregel.

http://hama.apache.org/

Hama is a distributed computing framework based on BSP (Bulk
Synchronous Parallel) computing techniques for massive scientific
computations.

BSP abstract computer is a bridging model for designing parallel algorithms. A BSP computer consists of processors connected by a communication network. Each processor has a fast local memory, and may follow different threads of computation. A BSP computation proceeds in a series of global supersteps.

Barrier synchronisation is a serious issue. When a process reaches this point (the barrier), it waits until all other processes have finished their communication actions.

On most of today's architectures, barrier synchronization is often expensive, so should be used sparingly. However, future architecture developments may make them much cheaper.

Reply
07/20/2012 08:11

Apache Giraph is an other implementation of Pregel.

http://giraph.apache.org/

Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails.

Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.

Reply
07/20/2012 08:32

Check out GoldenOrb - a cloud-based open source project for large scale graph analysis modeled on Pregel architecture.

Saves time and lets you run traditional and sophisticated analytics on entire data sets, instead of samples for accuracy.

http://goldenorbos.org/

Reply
07/20/2012 09:06

GraphLab - http://graphlab.org/ - is a graph-based, high performance, distributed computation framework written in C++.

While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.

GraphLab Features:

A unified multicore and distributed API: write once run efficiently in both shared and distributed memory systems

Tuned for performance: optimized C++ execution engine leverages extensive multi-threading and asynchronous IO

Scalable: GraphLab intelligently places data and computation using sophisticated new algorithms

HDFS Integration: Access your data directly from HDFS

Powerful Machine Learning Toolkits: Turn BigData into actionable knowledge with ease

Reply
07/20/2012 10:44

The techniques for predictive analysis and models demand speed.

Percolator, Dremel and Pregel got speed.

Reply
07/20/2012 11:00

Take a look at LevelDB - a fast and lightweight key/value database library.

It provides an ordered mapping from string keys to string values.

http://code.google.com/p/leveldb/

Reply
07/20/2012 11:25

Hang fire - just when Hadoop / MapReduce becomes the new hot thing, Google moves past it.

Reply
07/20/2012 11:38

Tenzing is good news to improve speed of Hadoop ecosystem. The biggest performance opportunities for Hive currently lie in making the execution engine more CPU efficient.

Tenzing is a SQL Implementation on the MapReduce Framework. When compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency.

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.

Learn more: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf

http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf

It appears they manipulated MapReduce to get better performance for SQL:

+Workerpools: These are essentially long running processes that do various parts of the MapReduce job (map task, reduce task, job coordinator, etc). Having a pool of running processes makes latencies lower than they would be if you had to launch a binary for each task in the job. This is certainly the case with JVM launches in Hadoop. Hadoop gets part of this done with reusable JVMs. The tradeoff, of course, is that fault isolation becomes a messier proposition.

+Streaming and In-Memory Chaining: Allows two MapReduce jobs to communicate without temping to disk (GFS). I wonder if this can be done easily with just some InputFormat/OutputFormat magic ... I suspect this is do-able with some thought. Memory-chaining allows a mapper and a reducer to be co-located in the same process. This is probably going to be a bit harder to do in Hadoop.

+Sort Avoidance: This feature allowed you to tell MapReduce to shuffle, but not sort. I've seen the need for this in many applications. Again, makes perfect sense for Hadoop also.

+Block Shuffle: For smaller rows, when sorting is not needed, the block shuffle reduces overheads in the shuffle phase. This is a performance opportunity opened up by sort avoidance.

Reply

Tenzing’s advantage is low latency and thus a big performance improvement and welcome addition to the Hadoop ecosystem.

Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data.

http://research.google.com/pubs/pub37200.html

Tenzing (SQL on MapReduce) was built because the data-warehouse was getting too expensive and was not able to keep pace with the demands the users made - in terms of complex queries (SQL was always getting in the way), and performance (the ETL process was often a big bottleneck).

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf

Reply
07/20/2012 13:43

NuoDB is a NewSQL cloud database. It looks and behaves like a traditional SQL database from the outside but under the covers it's a revolutionary database solution. It is a new class of database for a new class of datacenter.

NuoDB is the first database that is 100% SQL compliant, guarantees ACID transactions, and scales out elastically on cloud-based computing resources.

Download Beta 7: http://www.nuodb.com/

Highlights include:

Performance and scalability — Beta 7 now supports up to 50 nodes enabling the database to elastically scale to tens of thousands of transactions per second on Amazon EC2 or local commodity servers.

Product hardening — Beta 7 provides additional hardening and increased stability.

SQL — Beta 7 now supports SQL 92 with 99 extensions as well as incorporating several querying and indexing enhancements and enhances SQL Standards compliance to cover a majority of application requirements.

NuoDB Administrative Console — Functional enhancements to graphical console, which was introduced in Beta 6.

New language drivers — Incorporates new Ruby, PHP/PDO, and Perl drivers to address growing Web development needs. These drivers were developed by the NuoDB Github community.

Reply



Leave a Reply