The emerging "Data Stack" or "Data Layer" is in full transition and can be viewed and defined many different ways. The ability to capture, analyze and learn from data generated at unprecedented scale, combined with means to access that information, on demand, when relevant, creates business opportunities we are only just beginning to appreciate.
One way simply defines data in a three layer stack:
The top layer of the stack, internal data, is specific to an organization. The contextual layer comes from other sources. The integrated data model is for advanced data analytics applications.
Another more complex way is represented in the above image:
As the foundational layer in the big data stack, speed kills along with scalable persistence and compute power. At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms. At the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.
At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.
There are three data layer trends: data growth, web application user growth and the explosion of mobile computing.
Data growth [Big Data]. IDC estimates an organizations data will double every two years. Mining this raw data for valuable, actionable insights is challenging. Hadoop (HDFS, MapReduce, Cassandra and Hive) are batch-processing oriented and assist in analyzing large data sets.
User growth [NoSQL]. Most new interactive software systems are accessed via browser. If available on the public Internet, these applications now have 2 billion potential users and a 24x7 uptime requirement. Regardless of dataset size, these software systems put unprecedented pressure on the data layer: massive user concurrency; need for predictable, low-latency random access to data to maintain a snappy interactive user experience; and the need for continuous operations, even during database maintenance. Couchbase and MongoDB are open source NoSQL technologies that meet the data management needs of interactive web applications.
Mobile computing growth [Mobile Sync]. Mobile devices are increasingly where we create and consume information. But data aggregation and processing will be accomplished in the cloud. IDC estimates that in 2015, 1.4 of the 4.9 zettabytes created that year will be "touched by the cloud." Delivering the right data to millions of mobile devices, when and where it is needed (and then getting it back again) is the mobile-cloud data sync challenge.
These three trends may constitute the future emerging modern data stack - one that supports the ebb and flow of information from web and mobile applications to the cloud.
The key is to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse datasets.
Data comes from a variety of sources (internal, external, contextual, integrated): data directly created by users of web and mobile applications, observations and metadata related to the use of web and mobile applications, external data feeds, intermediate analysis results. The processing of this information creates information needed by user-facing applications and is fed into a NoSQL solution.
The NoSQL solution provides low-latency, random access to the data, meeting the needs of web applications. It also allows a mobile synchronization server quick, random access to data needed by mobile users.
A Mobile Sync Server manages transient connections with mobile devices, delivering data to native mobile applications when and where it is needed; and receiving information in return.
Greenplum is a big data analytics company driving the future of Big Data analytics with breakthrough products that harness the skills of data science teams to help global organizations realize the full promise of business agility and become data-driven, predictive enterprises.
Greenplum Unified Analytics Platform (UAP) combines the co-processing of structured and unstructured data with a productivity engine that enables collaboration among your data science team. Greenplum UAP includes Greenplum Database, Greenplum HD, and Greenplum Chorus.
The SAP real-time data platform, based on the flagship SAP HANA platform, includes data management capabilities from SAP Sybase IQ, SAP Sybase ESP, SAP Sybase ASE and SAP Enterprise Information Management. It unlocks business value from "big data" by providing for real-time decision support within the window of opportunity with extreme capabilities to ingest, store and process big data in real-time.
Cut through the clutter of unwanted data with the SAP HANA database. This in-memory database can help your applications zero-in on the information they need – without wasting time sifting through irrelevant data. The result? Instant answers to your complex queries – and better decision making across your enterprise.
Empower your business users with anytime, anywhere access to key insights delivered in context. Get ready to increase responsiveness, reduce IT costs and workload, and drive better decision making across your organization.
Get a holistic, real-time view of your business and drill down into the information you need to make critical decisions – faster. Apply analytics to business scenarios, to help you predict trends, reduce costs, maximize efficiency, and uncover solutions to your pressing industry and line of business challenges – for a significant competitive edge. Choose from purpose built, role specific solutions together with rapid deployment options.
SAP Data Services and SAP Information Steward are intended to provide both business users and IT with an intuitive and comprehensive information management solution with planned enhancements that include:
SAP HANA Database implements in-memory database technology. There are four components within the software group:
SAP HANA DB (or HANA DB) refers to the database technology
SAP HANA Studio refers to the suite of tools provided by SAP for modeling
SAP HANA Appliance refers to HANA DB as delivered on partner certified hardware as an appliance. It also includes the modeling tools from HANA Studio as well replication and data transformation tools to move data into HANA DB
SAP HANA Application Cloud refers to the cloud based infrastructure for delivery of applications (typically existing SAP applications rewritten to run on HANA).
HANA DB takes advantage of the low cost of main memory (RAM), data processing abilities of multi-core processors and the fast data access of solid-state drives relative to traditional hard drives to deliver better performance of analytical and transactional applications.
It offers a multi-engine query processing environment which allows it to support both relational data (with both row- and column-oriented physical representations in a hybrid engine) as well as graph and text processing for semi- and unstructured data management within the same system.
Microsoft’s Big Data solution unleashes actionable insights for everyone from all their data through familiar tools. It also enables customers to uncover new insights by connecting to the world’s data through an open and flexible platform.
For customers with large or diverse datasets, Microsoft’s Big Data solution unleashes actionable business insights to drive smarter decisions from structured, semi-structured and unstructured data. Unlike the competition, it offers insights to everyone through integration with familiar Microsoft tools such as Excel, PowerPivot and Power View. In addition it enables customers to discover new insights by connecting to publicly available data and services from Azure Marketplace and social media sites such as Twitter and Facebook. Microsoft Big Data offers an Enterprise-ready Hadoop distribution through integration with key Microsoft components including Active Directory and System Center, and an open platform with full compatibility with Apache Hadoop APIs.
The Informatica Platform is a comprehensive, open, unified, and economical data integration platform that enables organizations to maximize their return on data by increasing the value of data while lowering its cost.
Oracle is uniquely qualified to combine everything needed to meet the big data challenge - including software and hardware – into one engineered system. The Oracle Big Data Appliance is an engineered system that combines optimized hardware with the most comprehensive software stack featuring specialized solutions developed by Oracle to deliver a complete, easy-to-deploy solution for acquiring, organizing and loading big data into Oracle Database 11g. It is designed to deliver extreme analytics on all data types, with enterprise-class performance, availability,
supportability and security. With Big Data Connectors, the solution is tightly integrated with Oracle Exadata and Oracle Database, so you can analyze all your data together with extreme performance.
Once data has been loaded from Oracle Big Data Appliance into Oracle Database or Oracle Exadata, end users can use one of the following easy-to-use tools for in-database, advanced analytics:
Oracle R Enterprise – Oracle’s version of the widely used Project R statistical environment enables statisticians to use R on very large data sets without any modifications to the end user experience. Examples of R usage include predicting airline delays at a particular airports and the submission of clinical trial analysis and results.
In-Database Data Mining – the ability to create complex models and deploy these on very large data volumes to drive predictive analytics. End-users can leverage the results of these predictive models in their BI tools without the need to know how to build the models. For example, regression models can be used to predict customer age based on
purchasing behavior and demographic data.
In-Database Text Mining – the ability to mine text from micro blogs, CRM system comment fields and review sites combining Oracle Text and Oracle Data Mining. An example of text mining is sentiment analysis based on comments. Sentiment analysis tries to show how customers feel about certain companies, products or activities.
In-Database Semantic Analysis – the ability to create graphs and connections between various data points and data sets. Semantic analysis creates, for example, networks of relationships determining the value of a customer’s circle of friends. When looking at customer churn customer value is based on the value of his network, rather than on just
the value of the customer.
In-Database Spatial – the ability to add a spatial dimension to data and show data plotted on a map. This ability enables end users to understand geospatial relationships and trends much more efficiently. For example, spatial data can visualize a network of people and their geographical proximity. Customers who are in close proximity can
readily influence each other’s purchasing behavior, an opportunity which can be easily missed if spatial visualization is left out.
In-Database MapReduce – the ability to write procedural logic and seamlessly leverage Oracle Database parallel execution. In-database MapReduce allows data scientists to create high-performance routines with complex logic.
In-database MapReduce can be exposed via SQL. Examples of leveraging in-database MapReduce are sessionization of weblogs or organization of Call Details Records (CDRs).
IBM’s integrated big data platform has four core capabilities: Hadoop-based analytics, stream computing, data warehousing, and information integration and governance.
The core capabilities are:
Supporting Platform Services:
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.
In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.
As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.
Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.
Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.
Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.