Oracle is uniquely qualified to combine everything needed to meet the big data challenge - including software and hardware – into one engineered system. The Oracle Big Data Appliance is an engineered system that combines optimized hardware with the most comprehensive software stack featuring specialized solutions developed by Oracle to deliver a complete, easy-to-deploy solution for acquiring, organizing and loading big data into Oracle Database 11g. It is designed to deliver extreme analytics on all data types, with enterprise-class performance, availability,
supportability and security. With Big Data Connectors, the solution is tightly integrated with Oracle Exadata and Oracle Database, so you can analyze all your data together with extreme performance.
Once data has been loaded from Oracle Big Data Appliance into Oracle Database or Oracle Exadata, end users can use one of the following easy-to-use tools for in-database, advanced analytics:
Oracle R Enterprise – Oracle’s version of the widely used Project R statistical environment enables statisticians to use R on very large data sets without any modifications to the end user experience. Examples of R usage include predicting airline delays at a particular airports and the submission of clinical trial analysis and results.
In-Database Data Mining – the ability to create complex models and deploy these on very large data volumes to drive predictive analytics. End-users can leverage the results of these predictive models in their BI tools without the need to know how to build the models. For example, regression models can be used to predict customer age based on
purchasing behavior and demographic data.
In-Database Text Mining – the ability to mine text from micro blogs, CRM system comment fields and review sites combining Oracle Text and Oracle Data Mining. An example of text mining is sentiment analysis based on comments. Sentiment analysis tries to show how customers feel about certain companies, products or activities.
In-Database Semantic Analysis – the ability to create graphs and connections between various data points and data sets. Semantic analysis creates, for example, networks of relationships determining the value of a customer’s circle of friends. When looking at customer churn customer value is based on the value of his network, rather than on just
the value of the customer.
In-Database Spatial – the ability to add a spatial dimension to data and show data plotted on a map. This ability enables end users to understand geospatial relationships and trends much more efficiently. For example, spatial data can visualize a network of people and their geographical proximity. Customers who are in close proximity can
readily influence each other’s purchasing behavior, an opportunity which can be easily missed if spatial visualization is left out.
In-Database MapReduce – the ability to write procedural logic and seamlessly leverage Oracle Database parallel execution. In-database MapReduce allows data scientists to create high-performance routines with complex logic.
In-database MapReduce can be exposed via SQL. Examples of leveraging in-database MapReduce are sessionization of weblogs or organization of Call Details Records (CDRs).
IBM’s integrated big data platform has four core capabilities: Hadoop-based analytics, stream computing, data warehousing, and information integration and governance.
The core capabilities are:
Supporting Platform Services:
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.
In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.
As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.
Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.
Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.
Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
The main components include:
See Hadoop Documentation: http://bit.ly/LqkJTP
The goal is to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem for each unique organization.
A traditional BI architecture has analytical processing first pass through a data warehouse.
In the new, modern BI architecture, data reaches users through a multiplicity of organization data structures, each tailored to the type of content it contains and the type of user who wants to consume it.
The data revolution (big and small data sets) provides significant improvements. New tools like Hadoop allow organizations to cost-effectively consume and analyze large volumes of semi-structured data. In addition, it complements traditional top-down data delivery methods with more flexible, bottom-up approaches that promote predictive or exploration analytics and rapid application development.
In the above diagram, the objects in blue represent traditional data architecture. Objects in pink represent the new modern BI architecture, which includes Hadoop, NoSQL databases, high-performance analytical engines (e.g. analytical appliances, MPP databases, in-memory databases), and interactive, in-memory visualization tools.
Most source data now flows through Hadoop, which primarily acts as a staging area and online archive. This is especially true for semi-structured data, such as log files and machine-generated data, but also for some structured data that cannot be cost-effectively stored and processed in SQL engines (e.g. call center records).
From Hadoop, data is fed into a data warehousing hub, which often distributes data to downstream systems, such as data marts, operational data stores, and analytical sandboxes of various types, where users can query the data using familiar SQL-based reporting and analysis tools.
Today, data scientists analyze raw data inside Hadoop by writing MapReduce programs in Java and other languages. In the future, users will be able to query and process Hadoop data using familiar SQL-based data integration and query tools.
The modern BI architecture can analyze large volumes and new sources of data and is a significantly better platform for data alignment, consistency and flexible predictive analytics.
Thus, the new BI architecture provides a modern analytical ecosystem featuring both top-down and bottom-up data flows that meet all requirements for reporting and analysis.
In the top-down world, source data is processed, refined, and stamped with a predefined data structure--typically a dimensional model--and then consumed by casual users using SQL-based reporting and analysis tools. In this domain, IT developers create data and semantic models so business users can get answers to known questions and executives can track performance of predefined metrics. Here, design precedes access. The top-down world also takes great pains to align data along conformed dimensions and deliver clean, accurate data. The goal is to deliver a consistent view of the business entities so users can spend their time making decisions instead of arguing about the origins and validity of data artifacts.
Creating a uniform view of the business from heterogeneous sets of data is not easy. It takes time, money, and patience, often more than most departmental heads and business analysts are willing to tolerate. They often abandon the top-down world for the underworld of spreadmarts and data shadow systems. Using whatever tools are readily available and cheap, these data hungry users create their own views of the business. Eventually, they spend more time collecting and integrating data than analyzing it, undermining their productivity and a consistent view of business information.
The bottom up world is a different process. Modern BI architecture creates an analytical ecosystem that brings prodigal data users back into the fold. It allows an organization to perform true ad hoc exploration (predictive or exploratory analytics) and promotes the rapid development of analytical applications using in-memory departmental tools. In a bottom-up environment, users can't anticipate the questions they will ask on a daily or weekly basis or the data they'll need to answer those questions. Often, the data they need doesn't yet exist in the data warehouse.
The modern BI architecture creates analytical sandboxes that let power users explore corporate and local data on their own terms. These sandboxes include Hadoop, virtual partitions inside a data warehouse, and specialized analytical databases that offload data or analytical processing from the data warehouse or handle new untapped sources of data, such as Web logs or machine data. The new environment also gives department heads the ability to create and consume dashboards built with in-memory visualization tools that point both to a corporate data warehouse and other independent sources.
Combining top-down and bottom-up worlds is challenging but doable with determined commitment.
BI professionals need to guard data semantics while opening access to data.
Business users need to commit to adhering to data standards.
Further, well designed data governance programs are an absolute requirement.
Gartner's Yvonne Genovese reviews the popular term "Big Data" and why IT Leaders should act now.
Pattern-Based Strategy: Getting Value from Big Data
"Big data" refers to the growth in the volume of data in organizations. Understanding how to use Pattern-Based Strategy to seek, model and adapt to patterns contained in big data will be a critical IT and business skill.
Ian Ayres of Yale University Law School talks about the ideas in his book,Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart. Ayres argues for the power of data and analysis over more traditional decision-making methods using judgment and intuition. He talks with EconTalk host Russ Roberts about predicting the quality of wine based on climate and rainfall, the increasing use of randomized data in the world of business, the use of evidence and information in medicine rather than the judgment of your doctor, and whether concealed handguns or car protection devices such as LoJack reduce the crime rate.
Hadoop was developed to enable applications to work with thousands of computational independent computers and petabytes of data. Hadoop is a popular open source project that not only has incorporated an implementation of the MapReduce programming model, but also includes other subprojects supporting reliable and scalable distributed computing such as HDFS, (a distributed file system) and Pig (a high-level data flow language for parallel computing), along with others. See http://hadoop.apache.org.
The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is really hard.
The main components include:
The range of applications that use Hadoop show the versatility of the MapReduce approach, and reviewing them provides some of the typical characteristics of problems suited to this approach:
Some good examples that display some or all of these characteristics include:
• Applications that boil lots of data down into ordered or aggregated results – sorting, word and phrase counts, building inverted indices mapping phrases to documents, phrase searching among large document corpuses.
• Batch analyses fast enough to satisfy the needs of operational and reporting applications, such as web traffic statistics or product recommendation analysis.
• Iterative analysis using data mining and machine learning algorithms, such as association rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.
• Statistical analysis and reduction, such as web log analysis, or data profiling
• Behavioral analyses such as click stream analysis, discovering content-distribution networks, viewing behavior of video audiences.
• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.
MapReduce is a programming model introduced and described by researchers at Google for parallel computation involving large data sets that are distributed across clusters of many processers. In contrast to the explicitly parallel programming models typically used with imperative language such as Java and C++, the MapReduce programming model is reminiscent of functional languages such as Lisp and APL, in its reliance on two basic operational steps:
• Map which describes the computation or analysis to be applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, and
• Reduce, in which the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results.
Conceptually, the computations applied during the Map phase to each input key/value pair are inherently independent, which means that both the data and the computations can be distributed across multiple storage and processing units and automatically parallelized.
A Common Example
The ability to scale based on automatic parallelization can be demonstrated using a common MapReduce example that counts the number of occurrences of each word in a collection of many documents. Looking at the problem provides a hierarchical view:
• The total number of occurrences of each word in the entire collection is equal to the sum of the occurrences of each word in each document;
• The total number of occurrences of each word in each document can be computed as the sum of the occurrences of each word in each paragraph;
• The total; number of occurrences of each word in each paragraph can be computed as the sum of the occurrences of each word in each sentence;
This apparent recursion provides the context for both our Map function, which instructs each processing node to map each word to its count, and the Reduce function, which collects the word count pairs and sums together the counts for each particular word. The runtime system is responsible for distributing the input to the processing nodes, initiating the Map phase, coordinating the communication of the intermediate results, initiating the Reduce phase, and then collecting the final results.
While we can speculate on the level of granularity for computation (document vs. paragraph vs. sentence), ultimately we can leave it up to the runtime system to determine the best distribution of data and allocation of computation to reduce the execution time. In fact, the value of a programming model such as MapReduce is that its simplicity essentially allows the programmer to describe the expected results of each of the computational phases while relying on the complier and runtime systems for optimal parallelization while providing fault-tolerance.